CCQA is new web-scale dataset for in-domain model pre-training. CCQA is a novel QA dataset based on the Common Crawl project. Using the readily available schema.org annotation, around 130 million multilingual question-answer pairs are extracted, including about 60 million English data-points.
Paper | Code | Results | Date | Stars |
---|