Application of a Structural Support Vector Machine method to N-gram based text classification in Serbian

  • Jovana Kovacevic University of Belgrade
  • Jelena Graovac University of Belgrade

Abstract

The paper presents classification results of a hierarchically organized document corpus in Serbian, by using Support Vector Machine method (SVM). Two techniques have been applied derived from the SVM with structural output: multiclass flat and hierarchical classification. Common representation model of a document and a class or a hierarchy of classes the document belongs to, specific for this form of the SVM method, is based on different length byte n-grams. Four tf-idf statistics have been used that define significance of an n-gram for a specific document. The techniques and statistics described have been tested on a hierarchically structured subset of the Ebart corpus of newspaper texts. The results obtained for both types of classifiers are similar for the corpus as a whole, while hierarchical classifier performs better for most specific classes with small number of texts.
Published
2015-12-17
How to Cite
KOVACEVIC, Jovana; GRAOVAC, Jelena. Application of a Structural Support Vector Machine method to N-gram based text classification in Serbian. Infotheca - Journal for Digital Humanities, [S.l.], v. 16, n. 1-2, dec. 2015. ISSN 2217-9461. Available at: <https://infoteka.bg.ac.rs/ojs/index.php/Infoteka/article/view/2016.16.1_2.1_en>. Date accessed: 22 oct. 2017. doi: https://doi.org/10.18485/infotheca.2016.16.1_2.1.

Keywords

hierarchical text classification, Support Vector Machine Method, Ebart corpus