Development of a methodology for identifying the authorship of binary and disassembled program codes based on an ensemble of modern natural language processing methods

Download article in PDF format

Authors: Kurtukova A. V., Romanov A. S., Shelupanov A. A.

Annotation: This article is part of a series of studies aimed at solving prob-lems of identifying the authorship of source code. The analysis of binary or disassembled code is a critical task in information security, software development, and computer forensics due to the need to protect intellectual property and copyright, as well as to identify the authors of malware. Any program is a machine code that can be disassembled (converted into text in assembly language) using specialized tools and analyzed for authorship by analogy with text in natural language. To solve this problem, the article proposes a technique based on the fastText ensemble, support vector machine (SVM) and the author-developed hy-brid neural network. The proposed methodology was evaluated on source codes in C and C++ languages, collected from the GitHub and Google Code Jam platforms, compiled into execut-able files and disassembled using reverse engineering tools. The average accuracy of identifying the author of disassembled code using the proposed method was more than 0.9. The technique was also tested on source codes, resulting in an average accura-cy of 0.96 in simple cases and more than 0.85 in complex cases (obfuscation, coding standards, etc.).

Viktor N. Maslennikov

Executive Secretary of the Editor’s Office

 Editor’s Office: 40 Lenina Prospect, Tomsk, 634050, Russia

  Phone / Fax: + 7 (3822) 51-21-21 / 51-43-02

  vnmas@tusur.ru

Subscription for updates