Complexity Approach to the Texts Categorization Problem
В.Б. Балакирский (университет г. Эссен, Германия)
We consider the categorization problem, where the category is a set of texts that are grouped together on the basis of the opinions of experts. The problem is formulated as discovering the rules, which can be used to represent the set of texts used for the training as a union of given collections and labeling these collections by the names of the categories. Application of these rules to an unknown text allows us to find the categories containing the text. We restrict our attention to the situation when the categorization has to be organized on the basis of the statistics of the text and not on the text itself. The presented algorithm is based on the construction of a modified dictionary for the fixed category using ``the complexity approach''. For an unknown text, the verifier controls how much the constructed modification of the dictionary reduces the text statistics. Numerical results are given for 9 categories of the Reuters-21578 dataset.