IIT Madras' Ai4Bharat Unveils Indicvoices Dataset Covering 22 Indian Languages

In a significant stride towards language inclusivity, AI4Bharat, a research lab at IIT Madras, has unveiled IndicVoices, an open-source speech dataset that encompasses 22 Indian languages. The launch took place on March 6.

What Happened: As per a report by Moneycontrol, the primary objective of AI4Bharat with IndicVoices is to accumulate spontaneous speech in various Indian languages.

AI4Bharat plans to utilize IndicVoices to create IndicASR, the first-ever Automatic Speech Recognition (ASR) model that caters to all 22 languages listed in the 8th schedule of the Indian Constitution. ASR models can transcribe spoken language into text, thereby facilitating various functions.

The dataset comprises 7,348 hours of audio from 16,237 speakers spanning 145 Indian districts and 22 languages. An anonymous expert estimates the cost of collecting this dataset to be around Rs 30 crore.

See Also: Are Google And Walmart Benefitting From Paytm's Regulatory Woes?

Out of the total 7,348 hours, AI4Bharat has already transcribed 1,639 hours, with a median of 73 hours per language. The lab has also made public an open-source blueprint for data collection, which includes standardized protocols, centralized tools, and comprehensive transcription guidelines.

Bhashini has contributed $5-6 million in funding to AI4Bharat for data collection. The open-source data collected will be utilized by the government-backed organization.

“This (datasets) will lead us to 22 language models and further lead us to use cases which we are building up,” stated Amitabh Nag, Chief Executive Officer, Bhashini.

Why It Matters: The launch of IndicVoices is a significant step towards promoting language inclusivity in India. The dataset will not only aid in the development of the IndicASR model but also provide a comprehensive resource for researchers and developers working on language-based AI models.

Read Next: After Disney Merger, Reliance Now Set To Acquire Paramount’s Stake In Viacom18


Engineered by Benzinga Neuro, Edited by Shomik Sen Bhattacharjee


The GPT-4-based Benzinga Neuro content generation system exploits the extensive Benzinga Ecosystem, including native data, APIs, and more to create comprehensive and timely stories for you. Learn more.


Market News and Data brought to you by Benzinga APIs

Don't miss a beat on the share market. Get real-time updates on top stock movers and trading ideas on Benzinga India Telegram channel.

Posted In: TechAI4Bharatartificial Integllience