Creation of Persian dataset for sentiment analysis in texts published in social networks
Keywords: sentiment analysis, social media, Twitter, Persian dataset, text processing, data labeling,
Abstract :
Sentiment analysis is the process of automatic detection of sentiments embedded in a social media posts such as text, image or video. It has become increasingly important in recent years due to the high volume of user-generated content on the Internet and the need for businesses and organizations to understand public opinion about their products or services. The accuracy and reliability of sentiment analysis algorithms depends on the quality of the dataset used for training and testing. Therefore, preparing a suitable dataset is very important for the success of sentiment analysis models. With this aim, this paper presents a dataset for author sentiment analysis using Twitter textual posts. Twitter has been chosen as a source of data collection due to its popularity and diverse range of users. The informal and colloquial language of Twitter texts, along with the presence of ambiguity, metaphor and irony, as well as the limitation of the allowed text length, have been other reasons for choosing this source. In this work, the localized crowdsourcing platform in ParsiAzma lab was used for tagging the tweets. Each tweet was tagged by three people and the final tag was decided by majority vote. This dataset, which has no subject restrictions and the entire labeling process is human, contains more than 5000 tweets, including 1948 positive tweets, 3021 negative tweets, and 284 neutral tweets. Sentiment analysis in these data has been done at the level of the document and based on the overall feeling of the author of the text.
[1] S. Kusal, S. Patil, J. Choudrie, K. Kotecha, D. Vora, and I. Pappas, “A Review on Text-Based Emotion Detection -- Techniques, Applications, Datasets, and Future Directions.” arXiv, Apr. 26, 2022. Accessed: Oct. 15, 2023. [Online]. Available: http://arxiv.org/abs/2205.03235
[2] A. Kumar and A. Jaiswal, “Systematic literature review of sentiment analysis on Twitter using soft computing techniques,” Concurrency and Computation, vol. 32, no. 1, p. e5107, Jan. 2020, doi: 10.1002/cpe.5107.
[3] N. Sabri, A. Edalat, and B. Bahrak, “Sentiment analysis of persian-english code-mixed texts,” in 2021 26th International Computer Conference, Computer Society of Iran (CSICC), IEEE, 2021, pp. 1–4. Accessed: Oct. 15, 2023. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9420605/
[4] K. Sailunaz and R. Alhajj, “Emotion and sentiment analysis from Twitter text,” Journal of Computational Science, vol. 36, p. 101003, 2019.
[5] P. Mehta and S. Pandya, “A review on sentiment analysis methodologies, practices and applications,” International Journal of Scientific and Technology Research, vol. 9, no. 2, pp. 601–609, 2020.
[6] B. Pang and L. Lee, “Sentiment Polarity Dataset Version 2.0,” Part of the Natural Language Tool Kit, for the Python computer language, 2002.
[7] K. Topal and G. Ozsoyoglu, “Movie review analysis: Emotion analysis of IMDb movie reviews,” in 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), IEEE, 2016, pp. 1170–1176. Accessed: Oct. 15, 2023. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/7752387/
[8] “Yelp.” Accessed: Oct. 30, 2023. [Online]. Available: https://www.yelp.com/dataset/challenge
[9] E. Vaziripour, C. Giraud-Carrier, and D. Zappala, “Analyzing the political sentiment of tweets in Farsi,” in Proceedings of the International AAAI Conference on Web and Social Media, 2016, pp. 699–702. Accessed: Oct. 25, 2023. [Online]. Available: https://ojs.aaai.org/index.php/ICWSM/article/view/14791
[10] Z. B. Nezhad and M. A. Deihimi, “Twitter sentiment analysis from Iran about COVID 19 vaccine,” Diabetes & Metabolic Syndrome: Clinical Research & Reviews, vol. 16, no. 1, p. 102367, 2022.
[11] H. Abdi Khojasteh, E. Ansari, and M. Bohlouli, “Large-Scale Colloquial Persian 0.5,” https://iasbs.ac.ir/~ansari/lscp/, Feb. 2020, Accessed: Oct. 25, 2023. [Online]. Available: https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3195
[12] M. Heidari and P. Shamsinejad, “Producing an instagram dataset for persian language sentiment analysis using crowdsourcing method,” in 2020 6th International Conference on Web Research (ICWR), IEEE, 2020, pp. 284–287.
[13] M. Panahandeh Nigjeh and S. Ghanbari, “Leveraging ParsBERT for cross-domain polarity sentiment classification of Persian social media comments,” Multimedia Tools and Applications, pp. 1–18, 2023.
[14] A. Go, R. Bhayani, and L. Huang, “Twitter sentiment classification using distant supervision,” CS224N project report, Stanford, vol. 1, no. 12, p. 2009, 2009.
[15] H. Poursepanj, J. Weissbock, and D. Inkpen, “uOttawa: system description for SemEval 2013 task 2 sentiment analysis in twitter,” in Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), 2013, pp. 380–383. Accessed: Oct. 15, 2023. [Online]. Available: https://aclanthology.org/S13-2062.pdf
[16] B. Velichkov et al., “SU-FMI: System description for SemEval-2014 task 9 on sentiment analysis in Twitter,” in Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), 2014, pp. 590–595. Accessed: Oct. 15, 2023. [Online]. Available: https://aclanthology.org/S14-2103.pdf
[17] P. Nakov, A. Ritter, S. Rosenthal, F. Sebastiani, and V. Stoyanov, “SemEval-2016 Task 4: Sentiment Analysis in Twitter.” arXiv, Dec. 03, 2019. doi: 10.48550/arXiv.1912.01973.
[18] S. Mohammad, F. Bravo-Marquez, M. Salameh, and S. Kiritchenko, “Semeval-2018 task 1: Affect in tweets,” in Proceedings of the 12th international workshop on semantic evaluation, 2018, pp. 1–17. Accessed: Oct. 15, 2023. [Online]. Available: https://aclanthology.org/S18-1001/
[19] A. Rogers, A. Romanov, A. Rumshisky, S. Volkova, M. Gronas, and A. Gribov, “RuSentiment: An enriched sentiment analysis dataset for social media in Russian,” in Proceedings of the 27th international conference on computational linguistics, 2018, pp. 755–763. Accessed: Oct. 15, 2023. [Online]. Available: https://aclanthology.org/C18-1064/
[20] S.-Y. Chen, C.-C. Hsu, C.-C. Kuo, Ting-Hao, Huang, and L.-W. Ku, “EmotionLines: An Emotion Corpus of Multi-Party Conversations.” arXiv, May 30, 2018. Accessed: Oct. 15, 2023. [Online]. Available: http://arxiv.org/abs/1802.08379
[21] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations.” arXiv, Jun. 04, 2019. Accessed: Oct. 15, 2023. [Online]. Available: http://arxiv.org/abs/1810.02508
[22] D. Demszky, D. Movshovitz-Attias, J. Ko, A. Cowen, G. Nemade, and S. Ravi, “GoEmotions: A Dataset of Fine-Grained Emotions.” arXiv, Jun. 02, 2020. Accessed: Oct. 15, 2023. [Online]. Available: http://arxiv.org/abs/2005.00547
[23] F. Rustam, I. Ashraf, A. Mehmood, S. Ullah, and G. S. Choi, “Tweets classification on the base of sentiments for US airline companies,” Entropy, vol. 21, no. 11, p. 1078, 2019.
[24] R. Asgarnezhad and S. A. Monadjemi, “Persian sentiment analysis: feature engineering, datasets, and challenges,” Journal of applied intelligent systems & information sciences, vol. 2, no. 2, pp. 1–21, 2021.
[25] S. Alimardani and A. Aghaie, “Opinion mining in Persian language using supervised algorithms,” 2015, Accessed: Apr. 22, 2024. [Online]. Available: https://www.sid.ir/paper/332700/en
[26] S. A. A. Asli, B. Sabeti, Z. Majdabadi, P. Golazizian, R. Fahmi, and O. Momenzadeh, “Optimizing annotation effort using active learning strategies: A sentiment analysis case study in persian,” in Proceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 2855–2861.
[27] A. Hatefi Ghahfarrokhi and M. Shamsfard, “Tehran stock exchange prediction using sentiment analysis of online textual opinions,” Intell Sys Acc Fin Mgmt, vol. 27, no. 1, pp. 22–37, Jan. 2020, doi: 10.1002/isaf.1465.
[28] T. S. Ataei, K. Darvishi, S. Javdan, B. Minaei-Bidgoli, and S. Eetemadi, “Pars-absa: an aspect-based sentiment analysis dataset for Persian,” arXiv preprint arXiv:1908.01815, 2019.
[29] K. Darvishi, S. Javdan, B. Minaei-Bidgoli, and S. Eetemadi, “Pars-ABSA: a Manually Annotated Aspect-based Sentiment Analysis Benchmark on Farsi Product Reviews,” in Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022, pp. 7056–7060.
[30] M. E. Basiri and A. Kabiri, “Words Are Important: Improving Sentiment Analysis in the Persian Language by Lexicon Refining,” ACM Trans. Asian Low-Resour. Lang. Inf. Process., vol. 17, no. 4, pp. 1–18, Dec. 2018, doi: 10.1145/3195633.
[31] A. Khodaei, A. Bastanfard, H. Saboohi, and H. Aligholizadeh, “Deep Emotion Detection Sentiment Analysis of Persian Literary Text,” 2022, Accessed: Oct. 15, 2023. [Online]. Available: https://www.researchsquare.com/article/rs-1796157/latest
[32] M. Shirghasemi, M. H. Bokaei, and M. Bijankhan, “The impact of active learning algorithm on a cross-lingual model in a Persian sentiment task,” in 2021 7th International Conference on Web Research (ICWR), IEEE, 2021, pp. 292–295. Accessed: Apr. 22, 2024. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9443156/
[33] P. Hosseini, A. A. Ramaki, H. Maleki, M. Anvari, and S. A. Mirroshandel, “SentiPers: A Sentiment Analysis Corpus for Persian.” arXiv, Jan. 01, 2021. Accessed: Oct. 15, 2023. [Online]. Available: http://arxiv.org/abs/1801.07737
[34] J. P. R. Sharami, P. A. Sarabestani, and S. A. Mirroshandel, “Deepsentipers: Novel deep learning models trained over proposed augmented persian sentiment corpus,” arXiv preprint arXiv:2004.05328, 2020.
[35] S. M. Mohammad and F. Bravo-Marquez, “WASSA-2017 Shared Task on Emotion Intensity.” arXiv, Aug. 11, 2017. Accessed: Oct. 15, 2023. [Online]. Available: http://arxiv.org/abs/1708.03700