Machine Prarphrase Corpus (MPC)

Introduced by Wahle et al. in Identifying Machine-Paraphrased Plagiarism

This dataset is used to train and evaluate models for the detection of machine-paraphrased text.

The training set consists of 200,767 paragraphs (98,282 original, 102,485 paraphrased) extracted from 8,024 Wikipedia (English) articles (4,012 original, 4,012 paraphrased using the SpinBot API).

The test set is divided into 3 subsets: one created from preprints of research papers on arXiv, one from graduation theses, and one from Wikipedia articles. Additionally, different marchine-paraphrasing methods were used.

Test sets:

SpinBot: 
    arXiv         - Original - 20,966;    Spun - 20,867
    Theses        - Original - 5,226;        Spun - 3,463
    Wikipedia    - Original - 39,241;    Spun - 40,729

SpinnerChief-4W: 
    arXiv         - Original - 20,966;    Spun - 21,671
    Theses        - Original - 2,379;        Spun - 2,941
    Wikipedia    - Original - 39,241;    Spun - 39,618

SpinnerChief-2W: 
    arXiv         - Original - 20,966;    Spun - 21,719
    Theses        - Original - 2,379;        Spun - 2,941
    Wikipedia    - Original - 39,241;    Spun - 39,697

Papers


Paper Code Results Date Stars

Dataset Loaders


Tasks


Similar Datasets


Modalities


Languages