About Methods for Classifying Hidden Language Concepts in Specialized Texts Involving Pseudoinverse, Clustering and Data Grouping
This paper discusses the problems of analysis of hidden language concepts in scientific texts in the Ukrainian language, using methods of text mining, dimensionality reduction, grouping of features and linear classifiers.
A corpus of scientific texts and dictionaries, as well as stop words and affixes, has been formed for processing specialized texts. The resulting texts were analyzed and converted into text frequency-inverse document frequency (TF-IDF) feature representation. In order to process the feature vector, we propose to use methods of dimensionality rteduction of the data, in particular, the algorithm for the synthesis of linear systems and Karunen – Loeve transform and grouping of features: T-stochastic grouping of nearest neighbors (T-SNE). A series of experiments were performed on test examples, in particular, for the determination of informational density in the text and classification by keywords in specialized texts using the method of random samples consensus (RANSAC). A method of classification of hidden language concepts was proposed, making use of clustering methods (K-means). As a result of the experiment, the structure of the classifier of hidden language concepts was obtained in structured texts was obtained, which gained a relatively high recognition accuracy (97 – 99 %) using such linear classification algorithms: decision trees and extreme gradient boost machine. The stability of the proposed method is investigated by using the perturbation of the original data by a variational autoencoder, test runs shown that sparse autocoder reduces the mean square error, but the separation band decreases, which affects the convergence of the classification algorithm.
In further research, we propose to apply other methods of analysis of structured texts and ways to improve the separability of specialized texts with similar authorial styles and different topic using a proposed set of parameters.
Keywords: text processing, language concepts, pseudoinverse, clusterization, methods of data groupings.
Cite as: Krak I., Kulias A., Petrovych V., Kuznetsov V. About Methods for Classifying Hidden Language Concepts in Specialized Texts Involving Pseudoinverse, Clustering and Data Grouping. Cybernetics and Computer Technologies. 2021. 2. P. 68–75. (in Ukrainian) https://doi.org/10.34229/2707-451X.21.2.7
