Python Source Code De-Anonymization Using Nested Bigrams

An important issue in cybersecurity is the insertion or modification of code by individuals other than the original authors of the code. This motivates research on authorship attribution of unknown source code. We have addressed the deficiencies of previously used feature extraction methods and propose a novel approach: Nested Bigrams. Such features are easy to extract and carry substantial information about the interconnections between the nodes of the abstract syntax tree. We also show that for a large number of authors, a Strongly Regularized Feed-forward Neural Network outperforms the Random Forest Classifier used in many code stylometric studies. A new ranking system for reducing the number of features is also proposed, and experiments show that this approach can reduce the feature set to 98 nested bigrams while maintaining a classification accuracy above 90 percent.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here