Vol. 13, No. 2, December 2012

Valentin Zhikov
Ontotext AD
 
Ivelina Nikolova
IICT, Bulgarian Academy of Sciences and Ontotext AD 
 
Laura Toloşi
Ontotext AD
 
Yavor Ivanov
Xenium Ltd.
 
Borislav Popov
Ontotext AD
 
Georgi Georgiev
Ontotext AD
 
ENHANCING SOCIAL NEWS MEDIA IN BULGARIAN WITH NATURAL LANGUAGE PROCESSING
 
UDK: 811.163.2’322.2
Keywords: natural language processing, machine learning, language agnostic approaches, keyword extraction, text classifica
Abstract: In this work we introduce a system based on natural language processing techniques which aim is to enhance social news media in Bulgarian. It solves the task of multi-class, multi-label classification of documents. We apply the algorithms to a collection of media articles from Svejo.net, a popular Bulgarian web resource comprising user-generated content. Our algorithms are one-versus-all classification methods widely used in the computational linguistics community. We describe the algorithms, the features employed and we evaluate the impact of the features on the performance of the models. Thereby, we show that knowledge about the user and user behavior can greatly improve performance. Also, despite the fact that our document collection is generated entirely by social media users, the quality of the classification results is comparable to that of previously reported studies. We address also the task of automatic keyword andkeyphrase extraction from unstructured text, and suit it tothe needs of Svejo.net for induction of’themes’. Themes are defined as text snippets that summarize the essence of an article.We evaluate the performance of severalgeneric methods for keyword and keyphrase extraction on acorpus of articles in Bulgarian. Themethods that we discuss rely on widely accepted informationretrieval and machine learning techniques and are languageindependent. We also consider the effect of a stemmer component on the keyphrase extraction accuracy. The satisfactory performance of our models in spite of the limitedlinguistic knowledge incorporated in them recommends ourmodels as a baseline for keyword and keyphrase extractionfor Bulgarian language.

PDF

 


REVIEWS

Georg Rehm, Hans Uszkoreit
STUDY BY EUROPE’S LEADING LANGUAGE TECHNOLOGY EXPERTS WARNS MOST EUROPEAN LANGUAGES UNLIKELY TO SURVIVE IN THE DIGITAL AGE

ARTICLE

Valentin Zhikov, Ivelina NikolovaLaura Toloşi, Yavor Ivanov, Borislav Popov, Georgi Georgiev
ENHANCING SOCIAL NEWS MEDIA IN BULGARIAN WITH NATURAL LANGUAGE PROCESSING

Mihaela Colhon
ACQUIRING SYNTACTIC TRANSLATION RULES FROM A PARALLEL TREEBANK

Vesna Pajić, Stasa Vujičić Stanković, Milos Pajić
TRANSDUCERS FOR ANNOTATING WEATHER INFORMATION IN METEOROLOGICAL TEXTS IN SERBIAN

Zоrаn Ristоvić
FROM CORPUS TO CLASSROOM: THE USE OF ALIGNED CORPORA IN ENGLISH LANGUAGE TEACHING

REVIEWS

Jelena Mitrović
ANOTHER EXPERIENCE IN E-LEARNING

Biljana Lazić
DIGITAL HUMANITIES 2012 HOW TO BECOME A VOLUNTEER AND ATTEND A CONFERENCE

Gordana Nedeljkov
THE THIRD EUROPEAN SUMMER SCHOOL IN DIGITAL HUMANITIES “CULTURE AND TECHNOLOGY” – ONE PARTICIPANT’S EXPERIENCE

CONFERENCE

Predrag Djukić
IFLA WORLD LIBRARY AND INFORMATION CONGRESS 78TH IFLA GENERAL CONFERENCE AND ASSEMBLY

REVIEWS

Darja Kovrlija, Valentina Tasić, Suzana Topalović
MULTIMEDIA DOCUMENT “PRIEST ĆIRA AND PRIEST SPIRA” THE UNITED KNOWLEDGE ON THE COMMON STUDENT’S PROJECT