Semantic clustering technique to identify the signs of extremism in text information

Download article in PDF format

Authors: Romanov A. S.

Annotation: This article is devoted to the method of semantic clustering of texts for its subsequent application when solving information security problems. In particular, the technique can be used to identify false and malicious text information prohibited by the legislation of the Russian Federation, propaganda of extremism and terrorism, calls to incite ethnic, religious and other hatred and discord. The method proposed in the article is based on modern BERTopic technology with experimentally selected parameters and algorithms. The methodology was evaluated on data sets containing calls for extremism, terrorism, and incite-ment of religious and ethnic hatred. The quality metrics used were the silhouette coefficient, the Kalinski–Harabase index, and the Davis–Boldin index. According to the obtained results, the HDBSCAN algorithm was chosen for clustering using the Euclidean metric, the LaBSE algorithm was selected for text representation, and the UMAP algorithm was employed for dimensionality reduction based on the Jaccard metric. This BERTopic configuration resulted in average scores of 0.68 for the silhouette coefficient, 0.36 for the Davis-Boldin index, and 136.49 for the Kalinsky–Harabase index.

Keywords: bertopic, information security, clustering, semantics

Editorial office address

Executive Secretary of the Editor’s Office

 Editor’s Office: 40 Lenina Prospect, Tomsk, 634050, Russia

  Phone / Fax: + 7 (3822) 701-582

  journal@tusur.ru

 

Viktor N. Maslennikov

Executive Secretary of the Editor’s Office

 Editor’s Office: 40 Lenina Prospect, Tomsk, 634050, Russia

  Phone / Fax: + 7 (3822) 51-21-21 / 51-43-02

Subscription for updates