This dataset is used to train and evaluate models for the detection of machine-paraphrased text.
The training set consists of 200,767 paragraphs (98,282 original, 102,485 paraphrased) extracted from 8,024 Wikipedia (English) articles (4,012 original, 4,012 paraphrased using the SpinBot API).
The test set is divided into 3 subsets: one created from preprints of research papers on arXiv, one from graduation theses, and one from Wikipedia articles. Additionally, different marchine-paraphrasing methods were used.
Test sets:
SpinBot:
arXiv - Original - 20,966; Spun - 20,867
Theses - Original - 5,226; Spun - 3,463
Wikipedia - Original - 39,241; Spun - 40,729
SpinnerChief-4W:
arXiv - Original - 20,966; Spun - 21,671
Theses - Original - 2,379; Spun - 2,941
Wikipedia - Original - 39,241; Spun - 39,618
SpinnerChief-2W:
arXiv - Original - 20,966; Spun - 21,719
Theses - Original - 2,379; Spun - 2,941
Wikipedia - Original - 39,241; Spun - 39,697
Paper | Code | Results | Date | Stars |
---|