Semantic clustering technique to identify the signs of extremism in text information
DOI: 10.21293/1818-0442-2024-27-4-141-149
DOI: 10.21293/1818-0442-2024-27-4-141-149
Abstract: This article is devoted to the method of semantic clustering of texts for its subsequent application when solving information security problems. In particular, the technique can be used to identify false and malicious text information prohibited by the legislation of the Russian Federation, propaganda of extremism and terrorism, calls to incite ethnic, religious and other hatred and discord. The method proposed in the article is based on modern BERTopic technology with experimentally selected parameters and algorithms. The methodology was evaluated on data sets containing calls for extremism, terrorism, and incite-ment of religious and ethnic hatred. The quality metrics used were the silhouette coefficient, the Kalinski–Harabase index, and the Davis–Boldin index. According to the obtained results, the HDBSCAN algorithm was chosen for clustering using the Euclidean metric, the LaBSE algorithm was selected for text representation, and the UMAP algorithm was employed for dimensionality reduction based on the Jaccard metric. This BERTopic configuration resulted in average scores of 0.68 for the silhouette coefficient, 0.36 for the Davis-Boldin index, and 136.49 for the Kalinsky–Harabase index.
Keywords: bertopic, information security, clustering, semantics
Authors and copyright holders:
—
For citation:
Romanov A. S. Semantic clustering technique to identify the signs of extremism in text information. Doklady Tomskogo gosudarstvennogo universiteta sistem upravleniya i radioelektroniki, 2024, vol. 27, no. 4, pp. 141–149. DOI: 10.21293/1818-0442-2024-27-4-141-149
Executive Secretary of the Editor’s Office
Editor’s Office: 40 Lenina Prospect, Tomsk, 634050, Russia
Phone / Fax: + 7 (3822) 701-582
Viktor N. Maslennikov
Executive Secretary of the Editor’s Office
Editor’s Office: 40 Lenina Prospect, Tomsk, 634050, Russia
Phone / Fax: + 7 (3822) 51-21-21 / 51-43-02