Machine Prarphrase Corpus (MPC)

Introduced by Wahle et al. in Identifying Machine-Paraphrased Plagiarism

This dataset is used to train and evaluate models for the detection of machine-paraphrased text.

The training set consists of 200,767 paragraphs (98,282 original, 102,485 paraphrased) extracted from 8,024 Wikipedia (English) articles (4,012 original, 4,012 paraphrased using the SpinBot API).

The test set is divided into 3 subsets: one created from preprints of research papers on arXiv, one from graduation theses, and one from Wikipedia articles. Additionally, different marchine-paraphrasing methods were used.

Test sets:

SpinBot: 
    arXiv         - Original - 20,966;    Spun - 20,867
    Theses        - Original - 5,226;        Spun - 3,463
    Wikipedia    - Original - 39,241;    Spun - 40,729

SpinnerChief-4W: 
    arXiv         - Original - 20,966;    Spun - 21,671
    Theses        - Original - 2,379;        Spun - 2,941
    Wikipedia    - Original - 39,241;    Spun - 39,618

SpinnerChief-2W: 
    arXiv         - Original - 20,966;    Spun - 21,719
    Theses        - Original - 2,379;        Spun - 2,941
    Wikipedia    - Original - 39,241;    Spun - 39,697

Homepage

Benchmarks

Add a new result Link an existing benchmark

No benchmarks yet. Start a new benchmark or link an existing one.

Papers

Paper	Code	Results	Date	Stars

Dataset Loaders

Add Remove

huggingface/datasets

18,424

Tasks

Paraphrase Identification

Similar Datasets

Senseval-2

CC-Stories

Autoencoder Paraphrase Dataset (AEPD)

Usage

Machine Prarphrase Corpus (MPC)

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Similar Datasets

Senseval-2

CC-Stories

Autoencoder Paraphrase Dataset (AEPD)

Usage

License Edit

Modalities Edit

Languages Edit

Benchmarks

Add a new result Link an existing benchmark

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages