ИНФОТЕКА - Часопис за информатику и библиотекарство

Vol. 13, No. 2, December 2012

Valentin Zhikov

Ontotext AD

Ivelina Nikolova

IICT, Bulgarian Academy of Sciences and Ontotext AD

Laura Toloşi

Ontotext AD

Yavor Ivanov

Xenium Ltd.

Borislav Popov

Ontotext AD

Georgi Georgiev

Ontotext AD

ENHANCING SOCIAL NEWS MEDIA IN BULGARIAN WITH NATURAL LANGUAGE PROCESSING

UDK: 811.163.2’322.2

Keywords: natural language processing, machine learning, language agnostic approaches, keyword extraction, text classifica

Abstract: In this work we introduce a system based on natural language processing techniques which aim is to enhance social news media in Bulgarian. It solves the task of multi-class, multi-label classification of documents. We apply the algorithms to a collection of media articles from Svejo.net, a popular Bulgarian web resource comprising user-generated content. Our algorithms are one-versus-all classification methods widely used in the computational linguistics community. We describe the algorithms, the features employed and we evaluate the impact of the features on the performance of the models. Thereby, we show that knowledge about the user and user behavior can greatly improve performance. Also, despite the fact that our document collection is generated entirely by social media users, the quality of the classification results is comparable to that of previously reported studies. We address also the task of automatic keyword andkeyphrase extraction from unstructured text, and suit it tothe needs of Svejo.net for induction of’themes’. Themes are defined as text snippets that summarize the essence of an article.We evaluate the performance of severalgeneric methods for keyword and keyphrase extraction on acorpus of articles in Bulgarian. Themethods that we discuss rely on widely accepted informationretrieval and machine learning techniques and are languageindependent. We also consider the effect of a stemmer component on the keyphrase extraction accuracy. The satisfactory performance of our models in spite of the limitedlinguistic knowledge incorporated in them recommends ourmodels as a baseline for keyword and keyphrase extractionfor Bulgarian language.

REVIEWS

Georg Rehm, Hans Uszkoreit
STUDY BY EUROPE’S LEADING LANGUAGE TECHNOLOGY EXPERTS WARNS MOST EUROPEAN LANGUAGES UNLIKELY TO SURVIVE IN THE DIGITAL AGE

ARTICLE

Valentin Zhikov, Ivelina Nikolova, Laura Toloşi, Yavor Ivanov, Borislav Popov, Georgi Georgiev
ENHANCING SOCIAL NEWS MEDIA IN BULGARIAN WITH NATURAL LANGUAGE PROCESSING

Mihaela Colhon
ACQUIRING SYNTACTIC TRANSLATION RULES FROM A PARALLEL TREEBANK

Vesna Pajić, Stasa Vujičić Stanković, Milos Pajić
TRANSDUCERS FOR ANNOTATING WEATHER INFORMATION IN METEOROLOGICAL TEXTS IN SERBIAN

Zоrаn Ristоvić
FROM CORPUS TO CLASSROOM: THE USE OF ALIGNED CORPORA IN ENGLISH LANGUAGE TEACHING

REVIEWS

ARTICLE

REVIEWS

CONFERENCE

REVIEWS

SEARCH

LATEST NEWS