Wordnet-Based Serbian Text Categorization
Abstract
A Serbian text categorization technique, based on the Serbian wordnet is presented. The author is guided by the hypothesis that the inclusion of morphological, syntactic and semantic information contained in lexical resources can improve the process of text documents categorization in Serbian, as one of morphologically rich languages. Ebart-3 corpus is used for driving experiments. It is a collection of newspaper articles in Serbian divided into three categories: Economics, Politics and Sport. The method is based on lists of representative synsets (for each category) from the Serbian wordnet and category assignment function, defined on the basis of these lists. Selection of representative synsets is based on the significance weight measure of a synset for the considered category. Inflection problem in Serbian is solved by means of the system of morphological dictionaries for Serbian. In order to evaluate the presented technique, micro- and macro-averaged Precision, Recall and F1 measures are used. For comparison purpose, another technique based on wordnet-encoded semantic domains is also developed. Instead of well-chosen synsets, representative lists for categories consist of all synsets that belong to semantic domains corresponding to the considered categories. The results show that the technique based on well-chosen synsets outperforms the technique based on semantic domains, although the main reason for enriching wordnet by semantic domains is its even more successful application in natural language processing tasks, especially in text categorization.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.