Dataset for Detecting AI-Generated Source Code

Download article in PDF format

Authors: Bukina S. G., Harchenko S. S.

Annotation: Modern generative language models are actively used for auto-mated source code generation, necessitating the development of detection methods. However, the creation of datasets for identi-fying machine-generated code remains a challenging task. This paper analyzes existing datasets, identifying their limitations. An original dataset is developed, comprising Python program-ming language solutions to problems written by humans and generated by state-of-the-art language models. Experimental evaluation is conducted using machine learning methods. The results demonstrate the promise of the proposed dataset while indicating the need for its further expansion or conducting new experiments to identify the optimal model.

Keywords: code classification, dataset, language models, machine learning, source code

Editorial office address

Executive Secretary of the Editor’s Office

 Editor’s Office: 40 Lenina Prospect, Tomsk, 634050, Russia

  Phone / Fax: + 7 (3822) 701-582

  journal@tusur.ru

 

Viktor N. Maslennikov

Executive Secretary of the Editor’s Office

 Editor’s Office: 40 Lenina Prospect, Tomsk, 634050, Russia

  Phone / Fax: + 7 (3822) 51-21-21 / 51-43-02

Subscription for updates