Enhancing Social Newa Media in Bulgarian with Natural Language Proc

  • Valentin Zhikov Ontotext AD
  • Ivelina Nikolova IICT, Bulgarian Academy of Sciences and Ontotext AD
  • Laura Toloşi Ontotext AD
  • Yavor Ivanov Xenium Ltd.
  • Borislav Popov Ontotext AD
  • Georgi Georgiev Ontotext AD

Abstract

In this work we introduce a system based on natural language processing techniques which aim is to enhance social news media in Bulgarian. It solves the task of multi-class, multi-label classification of documents. We apply the algorithms to a collection of media articles from Svejo.net, a popular Bulgarian web resource comprising user-generated content. Our algorithms are one-versus-all classification methods widely used in the computational linguistics community. We describe the algorithms, the features employed and we evaluate the impact of the features on the performance of the models. Thereby, we show that knowledge about the user and user behavior can greatly improve performance. Also, despite the fact that our document collection is generated entirely by social media users, the quality of the classification results is comparable to that of previously reported studies. We address also the task of automatic keyword and keyphrase extraction from unstructured text, and suit it to the needs of Svejo.net for induction of’themes’. Themes are defined as text snippets that summarize the essence of an article. We evaluate the performance of several generic methods for keyword and keyphrase extraction on a corpus of articles in Bulgarian. The methods that we discuss rely on widely accepted information retrieval and machine learning techniques and are languageindependent. We also consider the effect of a stemmer component on the keyphrase extraction accuracy. The satisfactory performance of our models in spite of the limited linguistic knowledge incorporated in them recommends our models as a baseline for keyword and keyphrase extraction for Bulgarian language.

Published
2024-03-04
How to Cite
ZHIKOV, Valentin et al. Enhancing Social Newa Media in Bulgarian with Natural Language Proc. Infotheca - Journal for Digital Humanities, [S.l.], v. 13, n. 2, p. 6-18, mar. 2024. ISSN 2217-9461. Available at: <https://infoteka.bg.ac.rs/ojs/index.php/Infoteka/article/view/391>. Date accessed: 22 jan. 2025.