CCQA is new web-scale dataset for in-domain model pre-training. CCQA is a novel QA dataset based on the Common Crawl project. Using the readily available schema.org annotation, around 130 million multilingual question-answer pairs are extracted, including about 60 million English data-points.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


  • Unknown

Modalities


Languages