Creation of Persian dataset for sentiment analysis in texts published in social networks
Subject Areas : AI and Robotics
1 -
Keywords: sentiment analysis, social media, Twitter, Persian dataset, text processing, data labeling,
Abstract :
Sentiment analysis is the process of automatic detection of sentiments embedded in a social media posts such as text, image or video. It has become increasingly important in recent years due to the high volume of user-generated content on the Internet and the need for businesses and organizations to understand public opinion about their products or services. The accuracy and reliability of sentiment analysis algorithms depends on the quality of the dataset used for training and testing. Therefore, preparing a suitable dataset is very important for the success of sentiment analysis models. With this aim, this paper presents a dataset for author sentiment analysis using Twitter textual posts. Twitter has been chosen as a source of data collection due to its popularity and diverse range of users. The informal and colloquial language of Twitter texts, along with the presence of ambiguity, metaphor and irony, as well as the limitation of the allowed text length, have been other reasons for choosing this source. In this work, the localized crowdsourcing platform in ParsiAzma lab was used for tagging the tweets. Each tweet was tagged by three people and the final tag was decided by majority vote. This dataset, which has no subject restrictions and the entire labeling process is human, contains more than 5000 tweets, including 1948 positive tweets, 3021 negative tweets, and 284 neutral tweets. Sentiment analysis in these data has been done at the level of the document and based on the overall feeling of the author of the text.