OSCAR

Introduced by Suárez et al. in A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. The dataset used for training multilingual models such as BART incorporates 138 GB of text.

Homepage

Benchmarks

Add a new result Link an existing benchmark

No benchmarks yet. Start a new benchmark or link an existing one.

Papers

Paper	Code	Results	Date	Stars

Dataset Loaders

Add Remove

huggingface/datasets (oscar)

18,412

huggingface/datasets (OSCAR-2109)

18,412

huggingface/datasets (OSCAR-2201)

18,412

huggingface/datasets (oscar-small)

18,412

huggingface/datasets (KoPI-CC)

18,412

huggingface/datasets (KoPI-CC_News)

18,412

huggingface/datasets (KoPI)

18,412

huggingface/datasets (oscar-mini)

18,412

huggingface/datasets (KoPI-CC)

18,412

huggingface/datasets (KoPI-CC_News)

18,412

huggingface/datasets (KoPI)

18,412

huggingface/datasets (OSCAR-2019-Burmese-fix)

18,412

huggingface/datasets (oscar-small)

18,412

huggingface/datasets (OSCAR-2301)

18,412

huggingface/datasets (oscar-2301-hpc)

18,412

huggingface/datasets (0)

18,412

Tasks

Language Modelling

Similar Datasets

ParaShoot

CoNLL 2017 Shared Task - Automatically Annotated Raw Texts and Word Embeddings

CoNLL 2017 Shared Task - Automatically Annotated Raw Texts and Word Embeddings

IndicGLUE

mC4

Usage

License

CC0 1.0

Modalities

Texts

Languages

Scottish Gaelic

Egyptian Arabic

South Azerbaijani

Central Kurdish

Dimli (individual language)

Northern Frisian

Western Frisian

Karachay-Balkar

Norwegian Nynorsk

Occitan (post 1500)

Oriya (macrolanguage)

Western Panjabi

Waray (Philippines)

Malay (individual language)