Annotating the Corpus of Contemporary Serbian

Miloš Utvić University of Belgrade, Faculty of Philology, Department of Library and Information Science

Abstract

This article describes stages in annotation of the 113 million Corpus of Contemporary Serbian (preparation and implementation). There are several levels of annotation which have been conducted. Corresponding bibliographical information is attached to each corpus text. Part-of-speech (PoS) tagset is prepared, based on the electronic morphological dictionary of Serbian, as well as dictionary of possible annotations adapted for TreeTagger, the PoS tagging system. The Corpus of Contemporary Serbian has been automatically, morphosyntactically annotated with TreeTagger software, i.e. information about part of speech and lemma has been attached to each corpus word form. TreeTagger used manually tagged one million word corpus INTERA as a training set. Ten-fold cross-validation is used for evaluation of applied annotation procedure.

2011_2_en_03.pdf

Published

2024-03-06

How to Cite

UTVIĆ, Miloš. Annotating the Corpus of Contemporary Serbian. Infotheca - Journal for Digital Humanities, [S.l.], v. 12, n. 2, p. 36-47, mar. 2024. ISSN 2217-9461. Available at: <https://infoteka.bg.ac.rs/ojs/index.php/Infoteka/article/view/435>. Date accessed: 14 july 2026.

Citation Formats

Issue

Vol 12 No 2 (2011): Infotheca - Journal of Informatics and Librarianship

Section

Articles

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

		Faculty of Philology, University of Belgrade
		University Library „Svetozar Marković“
		Association of Libraries of the Universities of Serbia

Annotating the Corpus of Contemporary Serbian

Abstract

Publisher