The code that created this dataset can be seen in https://github.com/nitzanfarhi/SecurityPatchDetection and can be reproduced by running:
python data_collection\create_dataset.py --all -o data_collection\data
Notice that this dataset doesn't include the commits' generated data as it is very big. This can be generated by running only :
python data_collection\create_dataset.py --commits -data_collection\data
A repository name is symbolised by <COMPANY_NAME>_<REPOSITORY_NAME>
This dataset is publicly available for researchers. If you are using our dataset,
you should cite our related research paper which outlines the details of the dataset and its underlying principles:
@article{farhi2023detecting, title={Detecting Security Patches via Behavioral Data in Code Repositories}, author={Farhi, Nitzan and Koenigstein, Noam and Shavitt, Yuval}, journal={arXiv preprint arXiv:2302.02112}, year={2023} } As well as mentioning gharchive.org, if you use their data as well.
Paper | Code | Results | Date | Stars |
---|