1 code implementation • EACL (DravidianLangTech) 2021 • Debapriya Tula, Prathyush Potluri, Shreyas Ms, Sumanth Doddapaneni, Pranjal Sahu, Rohan Sukumaran, Parth Patwa
Our model is able to handle code-mixed data as well as instances where the script used is mixed (for instance, Tamil and Latin).
1 code implementation • 11 Mar 2024 • Mohammed Safi Ur Rahman Khan, Priyam Mehta, Ananth Sankar, Umashankar Kumaravelan, Sumanth Doddapaneni, Suriyaprasaad G, Varun Balan G, Sparsh Jain, Anoop Kunchukuttan, Pratyush Kumar, Raj Dabre, Mitesh M. Khapra
We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages.
no code implementations • 10 Jan 2024 • Sumanth Doddapaneni, Krishna Sayana, Ambarish Jash, Sukhdeep Sodhi, Dima Kuzmin
Modeling long histories plays a pivotal role in enhancing recommendation systems, allowing to capture user's evolving preferences, resulting in more precise and personalized recommendations.
2 code implementations • 25 May 2023 • Jay Gala, Pranjal A. Chitale, Raghavan AK, Varun Gumma, Sumanth Doddapaneni, Aswanth Kumar, Janki Nawale, Anupama Sujatha, Ratish Puduppully, Vivek Raghavan, Pratyush Kumar, Mitesh M. Khapra, Raj Dabre, Anoop Kunchukuttan
Prior to this work, there was (i) no parallel training data spanning all 22 languages, (ii) no robust benchmarks covering all these languages and containing content relevant to India, and (iii) no existing translation models which support all the 22 scheduled languages of India.
2 code implementations • 12 May 2023 • Nandini Mundra, Sumanth Doddapaneni, Raj Dabre, Anoop Kunchukuttan, Ratish Puduppully, Mitesh M. Khapra
However, adapters have not been sufficiently analyzed to understand if PEFT translates to benefits in training/deployment efficiency and maintainability/extensibility.
1 code implementation • 10 May 2023 • Rahul Aralikatte, Ziling Cheng, Sumanth Doddapaneni, Jackie Chi Kit Cheung
We present V\=arta, a large-scale multilingual dataset for headline generation in Indic languages.
1 code implementation • 20 Dec 2022 • Arnav Mhaske, Harshit Kedia, Sumanth Doddapaneni, Mitesh M. Khapra, Pratyush Kumar, Rudra Murthy V, Anoop Kunchukuttan
The dataset contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location, and, Organization) for 9 out of the 11 languages.
1 code implementation • 11 Dec 2022 • Sumanth Doddapaneni, Rahul Aralikatte, Gowtham Ramesh, Shreya Goyal, Mitesh M. Khapra, Anoop Kunchukuttan, Pratyush Kumar
Across languages and tasks, IndicXTREME contains a total of 105 evaluation sets, of which 52 are new contributions to the literature.
no code implementations • 26 Aug 2022 • Kaushal Santosh Bhogale, Abhigyan Raman, Tahir Javed, Sumanth Doddapaneni, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra
Significantly, we show that adding Shrutilipi to the training set of Wav2Vec models leads to an average decrease in WER of 5. 8\% for 7 languages on the IndicSUPERB benchmark.
Optical Character Recognition (OCR) Self-Supervised Learning +3
no code implementations • 12 Mar 2022 • Shreya Goyal, Sumanth Doddapaneni, Mitesh M. Khapra, Balaraman Ravindran
In the past few years, it has become increasingly evident that deep neural networks are not resilient enough to withstand adversarial perturbations in input data, leaving them vulnerable to attack.
no code implementations • 12 Nov 2021 • Debapriya Tula, Shreyas Ms, Viswanatha Reddy, Pranjal Sahu, Sumanth Doddapaneni, Prathyush Potluri, Rohan Sukumaran, Parth Patwa
To summarize, our model can handle offensive language detection in a low-resource, class imbalanced, multilingual and code-mixed setting.
no code implementations • 6 Nov 2021 • Tahir Javed, Sumanth Doddapaneni, Abhigyan Raman, Kaushal Santosh Bhogale, Gowtham Ramesh, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra
Second, using this raw speech data we pretrain several variants of wav2vec style models for 40 Indian languages.
no code implementations • 1 Jul 2021 • Sumanth Doddapaneni, Gowtham Ramesh, Mitesh M. Khapra, Anoop Kunchukuttan, Pratyush Kumar
Multilingual Language Models (\MLLMs) such as mBERT, XLM, XLM-R, \textit{etc.}
Joint Multilingual Sentence Representations Multilingual text classification +4
1 code implementation • 12 Apr 2021 • Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, Raghavan AK, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Mahalakshmi J, Divyanshu Kakwani, Navneet Kumar, Aswin Pradeep, Srihari Nagaraj, Kumar Deepak, Vivek Raghavan, Anoop Kunchukuttan, Pratyush Kumar, Mitesh Shantadevi Khapra
We mine the parallel sentences from the web by combining many corpora, tools, and methods: (a) web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents, (c) multilingual representation models for aligning sentences, and (d) approximate nearest neighbor search for searching in a large collection of sentences.