Breaking Language Barriers: Introducing IndicLID – A Language Identification Breakthrough for Indian Languages
Did you know that India boasts 22 official languages, 122 major languages, and a whopping 1599 other languages? Despite this linguistic diversity, much of the content on the web is dominated by English.
To address this, Bhasha-Abhijnaanam comes into play as an impressive language identification test set. It covers a wide range of 22 Indic languages, including both native script and Romanized text. This comprehensive test set serves as a valuable resource for researchers and developers, enabling accurate identification of languages across various scripts in the diverse landscape of Indic languages.
Yash Madhani (AI4Bharat), Mitesh M. Khapra (IIT Madras), and Microsoft’s Anoop Kunchukuttan came together to conduct a study – Bhasha-Abhijnaanam: Native-script and Romanized Language Identification for 22 Indic Languages.
Language Identification Models for Indian Languages: Filling the Gaps in Existing Tools
In this study, the main focus was on creating a language identifier specifically designed for the 22 languages listed in the Indian constitution. As digital technologies continue to advance, there is a growing need to make NLP tools accessible to the wider population, including translation, ASR, and conversational technologies. A reliable language identifier is crucial for developing language resources in low-resource languages.
However, existing language identification tools have limitations when it comes to Indian languages. They often fail to cover all 22 languages and lack support for detecting Romanized Indian language text, which is commonly used in social media and chats. Given the significant number of internet users in India, our work on accurate and effective romanized Language Identification models holds great potential in the NLP field, particularly in social media and chat applications. Therefore, we take on the task of creating a language identifier specifically tailored for these 22 Indian languages.
Native script test set – Enhancing Language Coverage for Indian Languages
To expand the language coverage of existing datasets, the team curated a comprehensive native script test set. This test set encompasses 19 Indian languages and 11 scripts, incorporating data from the FLORES-200 dev-test and Dakshina sentence test set. They also generate native text test sets for three additional languages (Bodo, Konkani, Dogri) and one script (Manipuri in Meetei Mayek script) that were not included in the previous datasets.
To ensure the accuracy and quality of our test samples, they employ professional translators to translate English sentences sourced from Wikipedia into their respective languages. This meticulous approach guarantees reliability and minimizes potential data noise in our test set.
Roman script test set – Evaluating Language Identification for Indian Languages.
To evaluate the effectiveness of Roman-script language identification (LID) for 21 Indian languages, the team introduced a new benchmark test set. While the Dakshina romanized sentence test set already includes 11 of these languages, it contains short sentences consisting mainly of named entities and English loan words, which are not ideal for romanized text LID evaluation.
To address this limitation, they manually validated the Dakshina test sets and filter out approximately 7% of the sentences. For the remaining 10 languages, a benchmark test set by sampling sentences from IndicCorp and having annotators transliterate them into Roman script naturally, without strict guidelines, was created. Annotators were instructed to skip any invalid sentences (wrong language, offensive, truncated, etc.).
Filtering the Romanized Dakshina Test Set
The Dakshina romanized sentence test set contains short sentences that mainly consist of named entities and English loan words, making them unsuitable for evaluating romanized text language identification (LID). Manual validation was conducted for the Dakshina test sets for the target languages.
Two constraints were applied: sentences shorter than 5 words and sentences where the native language LID model had low confidence (prediction score less than 0.8). Native language annotators then reviewed these sentences, filtering out named entities and sentences where language determination was challenging. Approximately 7% of the sentences were filtered out. Refer to Table 2 for more details on the filtering statistics.
IndicLID: Language Classification for Indic Languages
IndicLID is a language classifier designed specifically for Indic languages, capable of predicting 47 language classes. This includes 24 classes for native scripts, 21 classes for Roman scripts, and additional classes for English and Others.
Three variants of the classifier were created: a fast linear classifier, a slower classifier fine-tuned from a pre-trained language model, and an ensemble model that balances speed and accuracy.
Training dataset creation
Training Dataset Creation: Native-Script Training Data: To create our training dataset, we gathered sentences from various sources such as IndicCorp, NLLB, Wikipedia, Vikaspedia, and internal sources. We ensured diversity and representation by sampling 100,000 sentences per language-script combination, maintaining a balanced distribution across these sources. For languages with fewer than 100,000 sentences, we employed oversampling techniques. The sentences were tokenized and normalized using the IndicNLP library with default settings.
Romanized Training Data: There is a scarcity of publicly available Romanized corpora for Indian languages. To address this, we utilized transliteration to generate synthetic Romanized data. We transliterated the native script training data into the Roman script using the multilingual IndicXlit transliteration model (Indic-to-En version). The quality of the transliteration was assessed based on the analysis provided by the authors of the IndicXlit model, ensuring the accuracy of the generated training data.
Also Read: The Rapidly Approaching Gen AI Transformation of Financial Media
Linear Classifier
For the linear classifier, FastText, a lightweight and efficient model commonly used for language identification tasks is utilized. FastText leverages character n-gram features, which provide subword information and allow the model to handle large-scale text data effectively.
This is particularly useful for distinguishing between languages with similar spellings or rare words. We trained separate classifiers for the native script (IndicLID-FTN) and Roman script (IndicLID-FTR). After experimentation, we determined that 8-dimensional word vector models were optimal in terms of both model size and accuracy.
Pretrained LM-based Classifier
To improve the performance of romanized text, models with larger capacities were explored. Specifically, pre-trained language models (LMs) were fine-tuned on the Romanized training dataset.
LMs: XLM-R, IndicBERT-v2, and MuRIL are evaluated.
IndicBERT-v2 and MuRIL are specifically designed for Indian languages, with MuRIL incorporating synthetic romanized data in its pre-training. The hyperparameters for the fine-tuning process can be found in Appendix B. Among these LMs, we selected the IndicBERT-based classifier (referred to as IndicLID-BERT) as it demonstrated strong performance on romanized text and offered broad language coverage.
Final Ensemble Classifier
Our IndicLID classifier is a pipeline consisting of multiple classifiers. Here’s how the pipeline works:
- Depending on the amount of Roman script in the input text, we choose either the native text or the Romanized linear classifier. We use IndicLID-FTR for text with more than 50% Roman characters.
- For romanized text, if IndicLID-FTR is not confident in its prediction, we redirect the request to IndicLID-BERT. This two-stage approach strikes a balance between classifier accuracy and inference speed. If IndicLID-FTR is confident in its prediction (probability of predicted class > 0.6), we use its prediction. Otherwise, we invoke the slower but more accurate IndicLID-BERT. This threshold provides a good trade-off between accuracy and speed.
Results and Discussion
To ensure data separation, the Flores-200 test set (NLLB Team et al., 2022) and the Dakshina test set (Roark et al., 2020) were excluded when sampling native training samples from various sources. Moreover, it was ensured that the benchmark test set did not include any training samples. Great precautions were taken to avoid overlaps between the test and validation sets. For the creation of the Romanized training set, we simply transliterated the native training set. Since the Dakshina test set provided parallel sentences for the native and Roman test sets, there was no overlap between the Roman training and test sets.
Native script language identification (LID): IndicLID-FTN with the NLLB model (NLLB Team et al., 2022) and the CLD3 model are compared. IndicLID-FTN performs comparably or better than other models in LID accuracy. Additionally, our model is 10 times faster and 4 times smaller than the NLLB model. We can further reduce the model’s size through model quantization (Joulin et al., 2016), which is a potential area for future work.
Read: 5 Ways Netflix is Using AI to Improve Customer Experience
Roman script language identification (LID): IndicLID-BERT outperforms IndicLID-FTR significantly, although there is a decrease in throughput. However, the ensemble model (IndicLID) maintains similar LID performance as IndicLID-BERT while achieving a 3x increase in throughput compared to IndicLID-BERT. To further improve the model throughput, future work can focus on creating distilled versions of the model.
ID confusion analysis: It is observed that the main source of confusion in language identification occurs between similar languages. For example, there are clusters of confusion between Hindi and closely related languages like Maithili, Urdu, and Punjabi, as well as between Konkani and Marathi, and Sindi and Kashmiri. Improving the accuracy of Romanized language identification, especially for very similar languages, is an important area for improvement.
Impact of synthetic training data: To assess the impact of synthetic training data, we generate a machine-transliterated version of the romanized test set using IndicXlit. We compare the accuracy of language identification on the original test set and the synthetically generated test set.
The synthetic test set exhibits data characteristics closer to the training data compared to the original test set. Closing the gap between the training and test data distributions, either by incorporating original romanized data in the training set or by improving the generation of synthetic romanized data to better reflect the true data distribution, is crucial for enhancing model performance.
The confusion matrix provides further insights into the impact of synthetic training data. Hindi, for example, is often confused with languages like Nepali, Sanskrit, Marathi, and Konkani which share the same native script (Devanagari). This could be attributed to the use of a multilingual transliteration model, which incorporates significant Hindi data, in creating the synthetic Romanized training data. Consequently, the synthetic Romanized forms of these languages may be more similar to Hindi compared to the original Romanized data.
Impact of input length: Language identification exhibits higher confusion rates for shorter inputs (less than 10 words), while performance remains relatively stable for longer inputs.
Limitations
The language identification benchmark primarily consists of clean sentences that are grammatically correct and written in a single script. However, real-world data often contains noise, such as ungrammatical sentences, mixed scripts, code-mixing, and invalid characters. A more representative benchmark that includes such use cases would be beneficial. Nevertheless, this benchmark adequately serves the purpose of collecting clean monolingual corpora and serves as an initial step for languages lacking an existing language identification benchmark.
Also Read: Focus on Upskilling Your Workforce for Generative AI Success
The use of synthetic training data introduces a performance gap due to differences in the distribution of training and test data. Acquiring original native romanized text and developing improved methods for generating romanized text are necessary to address this issue. It is important to note that the Romanized language identification model does not support Dogri since the IndicXlit transliteration model does not support Dogri. However, since Dogri is written in the Devanagari script, using the transliterator for Hindi, which shares the same script, may serve as a reasonable approximation for generating synthetic training data. Further exploration of this approach is planned for future research.
This work is limited to the 22 languages listed in the 8th schedule of the Indian constitution. Further work is required to expand the benchmark to include a broader range of widely used languages in India, considering that there are approximately 30 languages with more than a million speakers in the country.
Ethics Statement
The dataset annotations were conducted by native speakers of the languages from the Indian subcontinent who were employed and compensated with a competitive monthly salary. The remuneration was determined based on their expertise and experience, following the standards set by the government of our country. The dataset does not contain any harmful or offensive content. Annotators were fully informed that their annotations would be made publicly available and that no private information would be included in the annotations.
The proposed benchmark builds upon existing datasets and relevant works, which have been appropriately cited. The annotations were collected on a publicly accessible dataset and will be released to the public for future use. The IndicCorp dataset, which we annotated, has already been screened for offensive content. All datasets created as part of this project will be released under a CC-0 license, and the code and models will be released under an MIT license.
Conclusion
These tools will serve as a basis for building NLP resources for Indian languages, particularly extremely low-resource ones that are “left behind” in the NLP world today. The work takes the first steps towards LID of Romanized text, and our analysis reveals directions for future work.
[To share your insights with us, please write to sghosh@martechseries.com].
Comments are closed.