Sangraha is the largest high-quality, cleaned Indic language pretraining data containing 251B tokens summed up over 22 languages, extracted from curated sources, existing multilingual corpora and large-scale translations.
Paper | Code | Results | Date | Stars |
---|