Papers with Code Dataset Licensing Guide
On Licences
What is it?
Datasets indexed on Papers with Code helps to improve accessibility, reproducibility, and responsible
use of a particular dataset artefact. To ensure proper referencing and usage of a dataset, information
regarding its licensing is important.
A dataset license should clearly describe how to properly and responsibly use a dataset. It’s intended
to help guide the user of the dataset in making informed decisions about how the dataset can and cannot
be used.
Why is it important?
Having a proper license attached to a dataset encourages the user of the dataset to make informed
decisions on how to use the dataset. In addition, it protects the creator of the dataset from
liabilities associated with the use of the dataset. Without a license, there is also risk of malicious
and incorrect (intended or unintended) usage of a dataset that could lead to undesired
consequences.
Common Dataset Licences
Types of Licenses
Similar to licenses used for software products, licences applied to datasets also vary in restrictions
and permissions. Below are some of the main types of dataset licenses used in the field of machine
learning. The summaries for the CC licenses provided below are not a substitute for the license. Please
refer to the
official licensing guide for
complete guidance.
Public domain (CC0) -
is a public
dedication tool, which allows creators to give up their
copyright and put their works into the worldwide public domain. CC0 allows reusers to distribute, remix,
adapt, and build upon the material in any medium or format, with no conditions.
CC BY -
This
license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format,
so long as attribution is given to the creator. The license allows for commercial use.
CC BY-SA -
This license
allows reusers to distribute, remix, adapt, and build
upon the material in any medium or format, so long as attribution is given to the creator. The license
allows for commercial use. If you remix, adapt, or build upon the material, you must license the modified
material under identical terms.
CC BY-NC -
This license allows
reusers to distribute, remix, adapt, and build
upon the material in any medium or format for noncommercial purposes only, and only so long as attribution
is given to the creator.
CC BY-NC-SA -
This license
allows reusers to distribute, remix, adapt, and build
upon the material in any medium or format for noncommercial purposes only, and only so long as attribution
is given to the creator. If you remix, adapt, or build upon the material, you must license the modified
material under identical terms.
CC BY-ND -
This license
allows reusers to copy and distribute the material in
any medium or format in unadapted form only, and only so long as attribution is given to the creator.
The license allows for commercial use.
CC BY-NC-ND -
This
license allows reusers to copy and distribute the material in
any medium or format in unadapted form only, for noncommercial purposes only, and only so long as
attribution is given to the creator.
MIT - This license is primarily used for
licensing software products. See official license details
here.
GPL - This license is a free, copyleft
license for software and other kinds of works. See official license details
here.
Apache License, Version 2.0 - This
license is primarily used for licensing software products. See official license details
here.
BSD-3-Clause - This license is
primarily used for licensing software products. See official license details
here.
Custom - a non-standard license. Check the link to the license to learn more about the exact
terms of usage. In some cases we include a brief summary in parenthesis, e.g.
Custom
(non-commercial) is a custom license stipulating non-commercial usage.
Unknown - refers to a dataset for which a license is unknown or hasn’t been annotated yet. Check
the associated paper and dataset website for more information.
How to license your work?
- For CC-licensing it is straightforward as you only need
to choose the appropriate license that fits your needs. This must be communicated clearly,
including a direct link to the license you have chosen. An example could be: This work is
licensed under a CC BY 4.0
license. See official instructions here.
- For all other types of license used primarily for software
products like MIT, GPL, and Apache License 2.0, the instructions vary. Find more information
about such licenses in the Open Source
Initiative documentations.
Import note
This is not an exhaustive list of licenses available for datasets. These refer to some of the most
commonly used licenses. The Creative Common licenses described above refer to the 4.0 version. The
descriptions provided for each license above are not a substitute for the license. For the official
descriptions of each license you can refer to the provided links.
Common Pitfalls
Regarding dataset licensing, here are some common pitfalls that may occur:
- Annotation-only licenses. In some cases, especially in
Computer Vision it’s common for only the annotations to be released under a free license,
but images to remain proprietary (e.g. owned by Flickr) . Check the license terms to
understand what is covered and what not.
- Code-only licenses. In some cases, only the code to e.g. scrape
the data is available. While the code might be open source and freely license, the resulting
data may still not be made open and available for everything.
- Data ownership. Check the description of the dataset, as well as how it was
collected, to determine whether you have permission to use the dataset for your purposes.
Disclaimer
Papers with Code doesn’t make any warranties regarding any of the licensing information provided by the
datasets we index. Furthermore, Papers with code disclaims liability for damages resulting from use of
the licensing information provided by the creators of the datasets.