Proceedings of TUSUR University / Archive / Issue of Journal № 4, t. 28, 2025 / Feature Set Formation and Comparative Analysis of Classification Algorithms for AI Generated Code Detection

Feature Set Formation and Comparative Analysis of Classification Algorithms for AI Generated Code Detection

DOI: 10.21293/1818-0442-2025-28-4-121-126

Download article in PDF format

JATS xml

Abstract: The paper presents a comprehensive approach to to construct-ing a feature space for detecting artificially generated Python source code. We developed the Algorithmic_Analyzer class to extract 27 features categorized into four groups: basic code met-rics, structural patterns, keywords, and libraries. Additionally, lexical patterns are captured using word n-grams. Experiments using classical machine learning algorithms demonstrate that structural characteristics exhibit significantly higher signifi-cance than lexical features. The study identifies the most in-formative features for artificial code detection and establishes that the XGBClassifier model achieves the best performance, with an average F1_macro score of 0.90.

Keywords: machine learning, source code, language models, feature analysis, code classification

For citation:
Bukina S. G., Harchenko S. S. Feature Set Formation and Comparative Analysis of Classification Algorithms for AI Generated Code Detection. Doklady Tomskogo gosudarstvennogo universiteta sistem upravleniya i radioelektroniki, 2025, vol. 28, no. 4, pp. 121–126. DOI: 10.21293/1818-0442-2025-28-4-121-126

Authors and copyright holders:

Bukina S. G. , Tomsk State University of Control Systems and Radioelectronics (Tomsk, Russia)
Harchenko S. S. , Tomsk State University of Control Systems and Radioelectronics (Tomsk, Russia)

1. Illia Laura, Colleoni Elanor, Zyglidopoulos Stelios Ethical implications of text generation in the age of artificial intelligence. Business Ethics, the Environment & Responsibility, 2023, vol. 32, no. 1, pр. 201–210.
2. Pearce H., Ahmad B., Tan B., Dolan-Gavitt B., Karri R. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. 2022 IEEE Symposium on Security and Privacy (SP), 2022, pp. 754–768.
3. Ma W., Song Y., Xue M., Wen S., Xiang Y. The «Code» of Ethics: A Holistic Audit of AI Code Generators. IEEE Transactions on Dependable and Secure Computing, 2024, Vol. 21, No. 5, pp. 4997–5013.
4. Yoo S.H., Kim H.J. Security Analysis of Automated Code Generation: Structural Vulnerabilities in AI-Generated Code, 2025, Vol. 19, No. 1, 560-574 pp.
5. Bukina S.G., Kharchenko S.S. [Dataset for Detecting Artificially Generated Source Code]. Reports of TUSUR (Tomsk), 2025, Т.28, №2, 106-110 pp.
6. Dataset APPS on Hugging Face. Available at: https://huggingface.co/datasets/codeparrot/apps (Аccessed: 3 August 2025).
7. Idialu O.J., Mathews N.S., Maipradit R., Atlee J.M. Whodunit: Classifying Code as Human Authored or GPT-4 Generated – A case study on CodeChef problems. Proceedings of the 21st International Conference on Mining Software Repositories, Lisbon, Portugal, ACM, 2024, pp. 394–406.
8. Li K., Hong S., Fu C., Zhang Y., Liu M. Discriminating Human-authored from ChatGPT-Generated Code Via Discernable Feature Analysis. IEEE 34th International Symposium on Software Reliability Engineering Workshops, Florence, Italy, IEEE, 2023, pp. 120–127.
9. Hoq M., Shi Y., Leinonen J., Babalola D. Detecting ChatGPT-Generated Code Submissions in a CS1 Course Using Machine Learning Models. Proceedings of the 55th ACM Technical Symposium on Computer Science Education, New York, United States, ACM, 2024, pp. 526–532.
10. Bukhari S.A. Issues in Detection of AI-Generated Source Code: Master's thesis, University of Calgary, Calgary, Canada, 2024, 102 p.
11. Sjoerd S. The Detection of AI Generated Coding Content: Information and Computing Science: Master's thesis, Utrecht University, Utrecht, Netherlands, 2024, 75 p.
12. Li J., Wang X., Lin Y. Dynamic analysis of generated code for security vetting. Proceedings of the 2023 IEEE Symposium on Security and Privacy (S&P), San Francisco, USA, IEEE, 2023, pp. 1234–1249.
13. Suh H., Tafreshipour M., Li J., Bhattiprolu A., Ahmed I. An Empirical Study on Automatically Detecting AI-Generated Source Code: How Far Are We? The 47th IEEE/ACM International Conference on Software Engineering (ICSE 2025), Ottawa, ON, Canada, ACM, 2025, pp. 859–871.
14. Gurioli A., Gabbrielli M., Zacchiroli S. Is This You, LLM? Recognizing AI-written Programs with Multilingual Code Stylometry. 2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Montréal, Québec, Canada, IEEE, 2025, pp. 394–405.
15. Demirok B., Kutlu M. AIGCodeSet: A New Annotated Dataset for AI Generated Code Detection. arXiv, 2025, DOI: 10.48550/arXiv.2412.16594.
16. Demirok B., Kutlu M., Mergen S. MultiAIGCD: A Comprehensive dataset for AI Generated Code Detection Covering Multiple Languages, Models, Prompts, and Scenarios / B. Demirok. arXiv, 2025, DOI: 10.48550/arXiv.2507.21693.