Израда синтетичког евалуативног скупа података за Српски SentiWordNet користећи велике језичке моделе

  • Саша Петалинкар University of Belgrade

Abstract

This study presents the creation of a synthetic evaluation dataset for the Serbian SentiWordNet using Large Language Models (LLMs), specifically focusing on the Mistral model. Addressing the scarcity of sentiment analysis resources for Serbian, this research aims to bridge this gap by generating a dataset to evaluate and enhance sentiment analysis tools for Serbian. Sentiment polarity values from English SentiWordNet were automatically mapped to Serbian WordNet via the Inter-Lingual Index (ILI). To refine these values for better alignment with the Serbian language, an evaluation set was created. Initially, 500 synsets from Serbian WordNet were selected based on their alignment with the senti-pol-sr lexicon and mapped values from SentiWordNet. These synsets underwent sentiment polarity classification using the Mistral model. A balanced subset of 75 synsets was then randomly extracted, further refined for sentiment gradation, and manually reviewed. The findings demonstrate a high model reliability, with approximately 93.3% of responses meeting the established acceptability criteria, highlighting the efficacy of LLMs like Mistral in automating sentiment analysis for languages with limited resources.

Published
2025-03-17
How to Cite
ПЕТАЛИНКАР, Саша. Израда синтетичког евалуативног скупа података за Српски SentiWordNet користећи велике језичке моделе. Infotheca - Journal for Digital Humanities, [S.l.], v. 24, n. 1, p. 53-70, mar. 2025. ISSN 2217-9461. Available at: <https://infoteka.bg.ac.rs/ojs/index.php/Infoteka/article/view/2024.24.1.3_sr>. Date accessed: 10 apr. 2025.