Application of machine learning methods and feature selection based on a genetic algorithm in solving the problem of determining the authorship of a Russian-language text for cybersecurity

Download article in PDF format

Authors: Kurtukova A. V., Romanov A. S., Fedotova A. M., Shelupanov A. A.

Annotation: The article explores the approaches to determine the author of a natural language text, the advantages and disadvantages of these approaches. The identification is carried out using classical machine learning algorithms and neural network architectures (including fastText, CNN and LSTM and their hybrids, BERT). The efficiency of the model is evaluated based on the social media texts dataset. A separate experiment is devoted to the feature selection using a genetic algorithm. SVM trained on a selected 400 features set makes it possible to achieve up to 10% increase in accuracy for all considered numbers of authors. Neural networks achieve a classification accuracy of 96%, but their training time in some cases exceeds the time spent on training SVM and other classical machine learning methods in some cases. For SVM together with the genetic algorithm, the average accuracy was 66%, for deep neural networks and fastText – 73 and 68%, respectively.

Keywords: authorship, text mining, machine learning, neural networks, deep learning, feature selection

Editorial office address

Executive Secretary of the Editor’s Office

 Editor’s Office: 40 Lenina Prospect, Tomsk, 634050, Russia

  Phone / Fax: + 7 (3822) 701-582

  journal@tusur.ru

 

Viktor N. Maslennikov

Executive Secretary of the Editor’s Office

 Editor’s Office: 40 Lenina Prospect, Tomsk, 634050, Russia

  Phone / Fax: + 7 (3822) 51-21-21 / 51-43-02

  vnmas@tusur.ru

Subscription for updates