2021, issue 2, p. 68-75
Received 24.02.2021; Revised 12.03.2021; Accepted 24.06.2021
Published 30.06.2021; First Online 01.07.2021
https://doi.org/10.34229/2707-451X.21.2.7
Previous | Full text (in Ukrainian) | Next
About Methods for Classifying Hidden Language Concepts in Specialized Texts Involving Pseudoinverse, Clustering and Data Grouping
Iurii Krak 1, 2 * , Anatoliy Kulias 1 , Valentina Petrovych 1 *, Vladyslav Kuznetsov 1 *
1 V.M. Glushkov Institute of Cybernetics of the NAS of Ukraine, Kyiv
2 Taras Shevchenko National University of Kyiv of Ukraine
* Correspondence: This email address is being protected from spambots. You need JavaScript enabled to view it. This email address is being protected from spambots. You need JavaScript enabled to view it. This email address is being protected from spambots. You need JavaScript enabled to view it.
This paper discusses the problems of analysis of hidden language concepts in scientific texts in the Ukrainian language, using methods of text mining, dimensionality reduction, grouping of features and linear classifiers.
A corpus of scientific texts and dictionaries, as well as stop words and affixes, has been formed for processing specialized texts. The resulting texts were analyzed and converted into text frequency-inverse document frequency (TF-IDF) feature representation. In order to process the feature vector, we propose to use methods of dimensionality rteduction of the data, in particular, the algorithm for the synthesis of linear systems and Karunen – Loeve transform and grouping of features: T-stochastic grouping of nearest neighbors (T-SNE). A series of experiments were performed on test examples, in particular, for the determination of informational density in the text and classification by keywords in specialized texts using the method of random samples consensus (RANSAC). A method of classification of hidden language concepts was proposed, making use of clustering methods (K-means). As a result of the experiment, the structure of the classifier of hidden language concepts was obtained in structured texts was obtained, which gained a relatively high recognition accuracy (97 – 99 %) using such linear classification algorithms: decision trees and extreme gradient boost machine. The stability of the proposed method is investigated by using the perturbation of the original data by a variational autoencoder, test runs shown that sparse autocoder reduces the mean square error, but the separation band decreases, which affects the convergence of the classification algorithm.
In further research, we propose to apply other methods of analysis of structured texts and ways to improve the separability of specialized texts with similar authorial styles and different topic using a proposed set of parameters.
Keywords: text processing, language concepts, pseudoinverse, clusterization, methods of data groupings.
Cite as: Krak I., Kulias A., Petrovych V., Kuznetsov V. About Methods for Classifying Hidden Language Concepts in Specialized Texts Involving Pseudoinverse, Clustering and Data Grouping. Cybernetics and Computer Technologies. 2021. 2. P. 68–75. (in Ukrainian) https://doi.org/10.34229/2707-451X.21.2.7
References
1. Dzhurabaiev O.V., Barmak O.V., Manziuk E.A., Skrypnyk T.K. Searching for context in the text. Bulletin of Khmelnytsky National University. Ser. "Technical Sciences". 2019. 4 (275). P. 80–83.
2. Barmak O., Mazurets O., Zhyvilik А. Information technology of automatic creation of annotations and abstracts from digital texts. Bulletin of Khmelnytsky National University. Ser. "Technical Sciences". 2017. 4 (251). P. 147–158.
3. Robertson S. Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation. 2013. 60 (5). P. 503–520. https://doi.org/10.1108/00220410410560582
4. Кrak Iu.V., Kudin G.I., Кulyas А.I. Multidimensional Scaling by Means of Pseudoinverse Operations. Cybernetics and Systems Analysis. 2019. 55 (1). P. 22–29. https://doi.org/10.1007/s10559-019-00108-9
5. Visualizing Data using t-SNE. Journal of Machine Learning Research. 2017. 9. P. 2595.
6. Krak Iu., Kruchynin K., Barmak A., Manziuk E. Visual Analytics in Machine Training Systems for Effective Decision. Springer. 2020. P. 327–338. https://doi.org/10.1007/978-94-024-2030-2_25
7. Krak Yu.V., Barmak A.V, Manziuk E.A., Kasianiuk V.S. Information Technology of Separating Hyperplanes Synthesis for Linear Classifiers. Journal of Automation and Information Science. 2019. 51 (5). P. 54–64. https://doi.org/10.1615/JAutomatInfScien.v51.i5.50
8. Krivonos Iu.G., Kirichenko М., Krak I., Donchenko V., Kulias A. Analysis and synthesis of situations in decision-making systems. Кyiv: Scientific Opinion, 2009. 336 p.
9. Hast Anders, Nysjö Johan, Marchetti Andrea. Optimal RANSAC – Towards a Repeatable Algorithm for Finding the Optimal Set. WSCG. 2013. 21 (1). P. 21–30.
10. Hinton G.E., Salakhutdinov R.R. Reducing the Dimensionality of Data with Neural Networks. Science. 2006. 313 (5786). P. 504–507. https://doi.org/10.1126/science.1127647
ISSN 2707-451X (Online)
ISSN 2707-4501 (Print)
Previous | Full text (in Ukrainian) | Next