Application of a Structural Support Vector Machine method to N-gram based text classification in Serbian

Jovana Kovacevic University of Belgrade
Jelena Graovac University of Belgrade

DOI: https://doi.org/10.18485/infotheca.2016.16.1_2.1

Abstract

The paper presents classification results of a hierarchically organized document corpus in Serbian, by using Support Vector Machine method (SVM). Two techniques have been applied derived from the SVM with structural output: multiclass flat and hierarchical classification. Common representation model of a document and a class or a hierarchy of classes the document belongs to, specific for this form of the SVM method, is based on different length byte n-grams. Four tf-idf statistics have been used that define significance of an n-gram for a specific document. The techniques and statistics described have been tested on a hierarchically structured subset of the Ebart corpus of newspaper texts. The results obtained for both types of classifiers are similar for the corpus as a whole, while hierarchical classifier performs better for most specific classes with small number of texts.

Published

2015-12-17

How to Cite

KOVACEVIC, Jovana; GRAOVAC, Jelena. Application of a Structural Support Vector Machine method to N-gram based text classification in Serbian. Infotheca - Journal for Digital Humanities, [S.l.], v. 16, n. 1-2, dec. 2015. ISSN 2217-9461. Available at: <https://infoteka.bg.ac.rs/ojs/index.php/Infoteka/article/view/2016.16.1_2.1_en>. Date accessed: 14 feb. 2026. doi: https://doi.org/10.18485/infotheca.2016.16.1_2.1.

Citation Formats

Issue

Vol 16 No 1-2 (2016): Infotheca - Journal for Digital Humanities

Section

Articles

Keywords

hierarchical text classification, Support Vector Machine Method, Ebart corpus

		Faculty of Philology, University of Belgrade
		University Library „Svetozar Marković“
		Association of Libraries of the Universities of Serbia

Application of a Structural Support Vector Machine method to N-gram based text classification in Serbian

Abstract

Keywords

Publisher