CLUECorpus2020 is a large-scale corpus that can be used directly for self-supervised learning such as pre-training of a language model, or language generation. It has 100G raw corpus with 35 billion Chinese characters, which is retrieved from Common Crawl.
Source: CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language ModelPaper | Code | Results | Date | Stars |
---|