Application of machine learning methods and feature selection based on a genetic algorithm in solving the problem of determining the authorship of a Russian-language text for cybersecurity — Doklady Tomskogo gosudarstvennogo universiteta sistem upravleniya i radioelektroniki

Abstract: The article explores the approaches to determine the author of a natural language text, the advantages and disadvantages of these approaches. The identification is carried out using classical machine learning algorithms and neural network architectures (including fastText, CNN and LSTM and their hybrids, BERT). The efficiency of the model is evaluated based on the social media texts dataset. A separate experiment is devoted to the feature selection using a genetic algorithm. SVM trained on a selected 400 features set makes it possible to achieve up to 10% increase in accuracy for all considered numbers of authors. Neural networks achieve a classification accuracy of 96%, but their training time in some cases exceeds the time spent on training SVM and other classical machine learning methods in some cases. For SVM together with the genetic algorithm, the average accuracy was 66%, for deep neural networks and fastText – 73 and 68%, respectively.

Keywords: authorship, text mining, machine learning, neural networks, deep learning, feature selection

For citation:
Kurtukova A. V., Romanov A. S., Fedotova A. M., Shelupanov A. A. Application of machine learning methods and feature selection based on a genetic algorithm in solving the problem of determining the authorship of a Russian-language text for cybersecurity. Doklady Tomskogo gosudarstvennogo universiteta sistem upravleniya i radioelektroniki, 2022, vol. 25, no. 1, pp. 79–85. DOI: 10.21293/1818-0442-2021-25-1-79-85