Random forest based pseudorandom sequences classification algorithm
Download article in PDF format
Authors: Kozachok A. V., Spirin A. A., Golembiovskaya O. M.
Annotation: Recently, the number of confidential data leaks caused by internal violators has increased. Since modern DLP-systems cannot detect and prevent information leakage channels in encrypted or compressed form, an algorithm was proposed to classify pseudo-random sequences formed by data encryption and compression algorithms. Algorithm for constructing a random forest was used. An array of the frequency of occurrence of binary subsequences of 9-bit length and statistical characteristics of the byte distribution of sequences was chosen as the feature space. The presented algorithm showed the accuracy of 0,99 for classification of pseudorandom sequences. The proposed algorithm will improve the existing DLP-systems by increasing the accuracy of classification of encrypted and compressed data.
Keywords: statistical analysis of data, machine learning, classification of binary sequences, dlp systems, protection against leakage of information