Creating a Synthetic Evaluation Dataset for Serbian SentiWordNet Using Large Language Models
Abstract
This study introduces the creation of a synthetic evaluation dataset for Serbian SentiWordNet through the application of Large Language Models (LLMs), with a focus on the Mistral model. Confronting the significant scarcity of sentiment analysis resources for the Serbian language, this research endeavours to bridge this gap by generating a dataset that supports the evaluation and improvement of sentiment analysis tools tailored to Serbian. Within Serbian WordNet, sentiment polarity values were automatically mapped from the English SentiWordNet using the Inter-Lingual Index (ILI). To refine these values for better alignment with the Serbian language context, an evaluation set was created. Employing a rigorous methodology, 500 synsets were initially selected from Serbian WordNet based on their alignment with the senti-pol-sr lexicon and values mapped from SentiWordNet. These synsets were subjected to sentiment polarity classification via the Mistral model. A balanced subset of 75 synsets was then randomly extracted and subjected to finer sentiment gradation, followed by a thorough manual review. The study's findings reveal a high degree of model reliability, with approximately 93.3% of the responses fulfilling the established acceptability criteria. This outcome highlights the efficacy of LLMs like Mistral in automating sentiment analysis processes for languages with limited resources, underscoring the significant potential for broader application in under-represented linguistic contexts.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.