Proceedings of TUSUR University / Archive / Issue of Journal № 4, t. 26, 2023 / Development of a methodology for identifying the authorship of binary and disassembled program codes based on an ensemble of modern natural language processing methods

Development of a methodology for identifying the authorship of binary and disassembled program codes based on an ensemble of modern natural language processing methods

DOI: 10.21293/1818-0442-2023-26-4-53-60

Download article in PDF format

JATS xml

Abstract: This article is part of a series of studies aimed at solving problems of identifying the authorship of source code. The analysis of binary or disassembled code is a critical task in information security, software development, and computer forensics due to the need to protect intellectual property and copyright, as well as to identify the authors of malware. Any program is a machine code that can be disassembled (converted into text in assembly language) using specialized tools and analyzed for authorship by analogy with text in natural language. To solve this problem, the article proposes a technique based on the fastText ensemble, support vector machine (SVM) and the author-developed hybrid neural network. The proposed methodology was evaluated on source codes in C and C++ languages, collected from the GitHub and Google Code Jam platforms, compiled into executable files and disassembled using reverse engineering tools. The average accuracy of identifying the author of disassembled code using the proposed method was more than 0.9. The technique was also tested on source codes, resulting in an average accuracy of 0.96 in simple cases and more than 0.85 in complex cases (obfuscation, coding standards, etc.).

Keywords: source code, machine learning, author, neural networks, ensemble, disassembler

Funding: This work was carried out with financial support from the Ministry of Science and Higher Education of the Russian Federation within the framework of the basic part of the state assignment for TUSUR for 2023–2025 (project No. FEWM-2023-0015)

For citation:
Kurtukova A. V., Romanov A. S., Shelupanov A. A. Development of a methodology for identifying the authorship of binary and disassembled program codes based on an ensemble of modern natural language processing methods. Doklady Tomskogo gosudarstvennogo universiteta sistem upravleniya i radioelektroniki, 2023, vol. 26, no. 4, pp. 53–60. DOI: 10.21293/1818-0442-2023-26-4-53-60

Authors and copyright holders:

Kurtukova A. V. , Tomsk State University of Control Systems and Radioelectronics (Tomsk, Russia)
Romanov A. S. , Tomsk State University of Control Systems and Radioelectronics (Tomsk, Russia)
Shelupanov A. A. , Tomsk State University of Control Systems and Radioelectronics (Tomsk, Russia)

1. Rahayu S., Shahrin S., Hafeizah N., Yusof R., Abdollah M.F. A Forensic Traceability Index in Digital Forensic Investigation. Journal of Information Security, 2013, vol. 4., no. 1, pp. 19–32.
2. Schleimer S., Wilkerson D.S., Aiken A. Winnowing: local algorithms for document fingerprinting. Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD '03), Association for Computing Machinery, New York, NY, USA, 2003, pp. 76–85.
3. Abuhamad M., AbuHmed T., Mohaisen A., Nyang D. Large-Scale and Language-Oblivious Code Authorship Identification. Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, Toronto, ON, Canada, 2018, pp. 101–114.
4. Zhen L., Chen G., Chen C., Zou Y., Xu S. RoPGen: Towards Robust Code Authorship Attribution via Automatic Coding Style Transformation. Proceedings of the 2022 IEEE 44th International Conference on Software Engineering (ICSE), Pittsburgh, PA, USA, 2022, pp. 1906–1918.
5. Holland C., Khoshavi N., Jaimes G. Code authorship identification via deep graph CNNs. Proceedings of the 2022 ACM Southeast Conference (ACM SE ’22), 2022, pp. 144–150.
6. Bogomolov E., Kovalenko V., Rebryk Y., Bacchelli A., Bryksin T. Authorship attribution of source code: A language-agnostic approach and applicability in software engineering. Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, 2021, pp. 932–944.
7. Ullah F., Wang J., Jabbar S., Al-Turjman F., Alazab M. Source code authorship attribution using hybrid approach of program dependence graph and deep learning model. IEEE Access, 2019, vol. 7, pp. 141987–141999.
8. Song Q., Zhang Y., Ouyang L., Chen Y. BinMLM: Binary Authorship Verification with Flow-aware Mixture-of-Shared Language Model. Available at: https://arxiv.org/pdf/2203.04472, free (Accessed: November 18, 2023).
9. Rosenblum N., Zhu X., Miller B.P. Who Wrote This Code? Identifying the Authors of Program Binaries. Available at: https://pages.cs.wisc.edu/~jerryzhu/pub/Rosenblum11Authorship.pdf, free (Accessed: November 18, 2023).
10. Alrabaee S., Wang L., Debbabi M. BinGold: Towards robust binary analysis by extracting the semantics of binary code as semantic flow graphs (SFGs). Digital Investigation, 2016, vol. 18, pp. 11–22.
11. Caliskan-Islam A. When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries. Available at: https://arxiv.org/abs/1512.08546, free (Accessed: November 18, 2023).
12. Alrabaee S., Saleem N., Preda S., Wang L., Debbabi M. OBA2: An Onion Approach to Binary code Authorship Attribution. Digital Investigation, 2014, vol. 11, pp. 94–103.
13. Caliskan-Islam A., Harang R., Liu A. Deanonymizing programmers via code stylometry. Proceedings of the 24th USENIX Security Symposium, 2015, pp. 255–270. Available at: https://www.usenix.org/system/files/conference/usenixsecurity15/sec15-paper-caliskan-islam.pdf, free (Accessed: November 18, 2023).
14. Alrabaee S., Shirani P., Debbabi M., Wang L. On the Feasibility of Malware Authorship Attribution. Digital Investigation, 2016, vol. 28, pp. 3–11.
15. Zia T., Ilyas M.I.J. Source Code Author Attribution Using Author’s Programming Style and Code Smells. Intelligent Systems with Applications, 2017, vol. 5, pp. 27–33.
16. Google Code Jam. Available at: https://codingcompetitions.withgoogle.com/codejam, free (Accessed: November 18, 2023).
17. Codeforces. Available at: https://codeforces.com/, free (Accessed: October 25, 2023).
18. GCC, the GNU Compiler Collection. Available at: https://gcc.gnu.org, free (Accessed: November 18, 2023).
19. IDA Pro. Available at: https://hex-rays.com/ida-pro/, free (Accessed: October 25, 2023).
20. Kurtukova A.V., Romanov A.S. [Identification author of source code by machine learning methods]. SPIIRAS Proceedings, 2019, vol. 18, no. 3, pp. 741–765 (in Russ.).
21. Kurtukova A., Romanov A., Shelupanov A. Source Code Authorship Identification Using Deep Neural Networks. Symmetry, 2020, vol. 12, 2044. DOI: 10.3390/sym12122044.
22. Kurtukova A., Romanov A., Shelupanov A., Fedotova A. Complex Cases of Source Code Authorship Identification Using a Hybrid Deep Neural Network. Future Internet, 2022, vol. 14, 287. DOI: 10.3390/fi14100287.
23. AnalyzeC. Available at: https://github.com/ryarnyah/AnalyseC, free (Accessed: November 18, 2023).
24. Linux Kernel Coding Style. Available at: https://www.kernel.org/doc/html/v4.10/process/coding-style.html, free (Accessed: November 18, 2023).