Towards Stacking Ensemble-based Fine-Grained Hostile Class Classification (FGHCC) of Hindi Posts

Towards Stacking Ensemble-based Fine-Grained Hostile Class Classification (FGHCC) of Hindi Posts

  IJETT-book-cover           
  
© 2023 by IJETT Journal
Volume-71 Issue-10
Year of Publication : 2023
Author : Ankita Sharma, Udayan Ghose
DOI : 10.14445/22315381/IJETT-V71I10P218

How to Cite?

Ankita Sharma, Udayan Ghose, "Towards Stacking Ensemble-based Fine-Grained Hostile Class Classification (FGHCC) of Hindi Posts," International Journal of Engineering Trends and Technology, vol. 71, no. 10, pp. 191-204, 2023. Crossref, https://doi.org/10.14445/22315381/IJETT-V71I10P218

Abstract
Lately, there has been a phenomenal surge in Hostile Online Content (HOC). The detection and classification of HOC on Online Social Platforms (OSPs) are becoming an important research area in curbing the toxicity of OSPs. Numerous efforts have been made to address this issue in resource-affluent languages. Detecting and classifying hostile content in Hindi is still challenging due to its nature and constrained resources, like adequate multilabel hostile datasets. There has been phenomenal growth in Hindi online content (OC) due to the emergence of the UTF-8 standard. Consequently, malicious Hindi OC has also skyrocketed. There is a dire need to classify and curb Hindi maleficent content on various OSPs. This paper addresses the problem of FGHCC in Hindi (Devanagari Script) as a multilabel problem since significant overlap exists among the hostile classes. The Hindi Hostility Dataset is used in this work. This work exclusively focuses on FGHCC due to its emerging nature and the scarcity of existing research in this domain. In light of this, a two-tiered stacking ensemble of classifiers is introduced, leveraging problem transformation methods (PTMs) and various state-of-the-art Machine Learning Models (MLMs) such as GNB, DT, RF, SVM, LR, SGD with TF-IDF and unigrams as features are applied. The experimental results demonstrate that the proposed two-layered stacking ensemble based on PTMs with unigram and TF-IDF as features achieved the highest weighted F1 score of 0.60, which outperforms the MLMs used alone, based on One Vs. Rest (OVR), Binary Relevance (BR), Classifier Chains (CC), and Label Powerset (LPS) transformation approaches. Also, the proposed model performs competitively with complex models applied in the literature. Therefore, it indicates the efficacy of our proposed model in detecting fine-grained hostile classes in a resource-constraint scenario.

Keywords
Hindi, Hostile posts, Machine learning, Multilabel text classification, Stacking ensemble.

References
[1] Gopendra Vikram Singh et al., “EmoInHindi: A Multi-Label Emotion and Intensity Annotated Dataset in Hindi for Emotion Recognition in Dialogues,” arXiv, pp. 1-9, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Akinbohun Folake, Akinbohun Ambrose, and E. Oyinloye Oghenerukevwe, “Stacked Ensemble Model for Hepatitis in Healthcare System,” International Journal of Computer and Organization Trends, vol. 9, no. 4, pp. 25-29, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Ramchandra Joshi et al., “Evaluation of Deep Learning Models for Hostility Detection in Hindi Text,” 6th International Conference for Convergence in Technology, pp. 1-5, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Asif Hasan et al., “Analysing Hate Speech against Migrants and Women through Tweets Using Ensembled Deep Learning Model,” Computational Intelligence and Neuroscience, pp. 1-8, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[5] I. Gede Manggala Putra, and Dade Nurjanah, “Hate Speech Detection Indonesian Language Instagram,” International Conference on Advanced Computer Science and Information Systems, pp. 413-420, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Fatimah Alkomah, and Xiaogang Ma, “A Literature Review of Textual Hate Speech Detection Methods and Datasets,” Information, vol. 13, no. 6, pp. 1-22, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Vandana Jha et al., “Sentiment Analysis in a Resource Scarce Language: Hindi,” International Journal of Scientific and Engineering Research, vol. 7, no. 9, pp. 968-980, 2016.
[Google Scholar] [Publisher Link]
[8] Abdalsamad Keramatfar, and Hossein Amirkhani, “Bibliometrics of Sentiment Analysis Literature,” Journal of Information Science, vol. 45, no. 1, pp. 3-15, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Dhanashree S. Kulkarni, and Sunil S. Rodd, “Sentiment Analysis in Hindi-A Survey on the State-of-the-Art Techniques,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 21, no. 1, pp. 1-46, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Pratik Joshi et al., “The State and Fate of Linguistic Diversity and Inclusion in the NLP World,” arXiv, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Abhishek Velankar et al., “Hate and Offensive Speech Detection in Hindi and Marathi,” arXiv, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Fabio Poletto et al., “Resources and Benchmark Corpora for Hate Speech Detection: A Systematic Review,” Language Resources and Evaluation, vol. 55, no. 2, pp. 477-523, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Md Saroar Jahan, and Mourad Oussalah, “A Systematic Review of Hate Speech Automatic Detection using Natural Language Processing,” Neurocomputing, vol. 546, pp. 1-30, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Debanjana Kar et al., “No Rumours Please! A Multi-Indic-Lingual Approach for COVID Fake-Tweet Detection,” Grace Hopper Celebration India, pp. 1-5, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Akshaya Gangurde et al., “A Systematic Bibliometric Analysis of Hate Speech Detection on Social Media Sites,” Journal of Scientometric Research, vol. 11, no. 1, pp. 100-111, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Michael Wiegand, Melanie Siegel, and Josef Ruppenhofer, “Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language,” 14th Conference on Natural Language Processing - KONVENS 2018, pp. 1-10, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Marcos Zampieri et al., “SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (Offenseval),” arXiv, pp. 1-12, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Marcos Zampieri et al., “SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval2020),” arXiv, pp. 1-23, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Arjun Roy et al., “An Ensemble Approach for Aggression Identification in English and Hindi Text,” Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying, pp. 66-73, 2018.
[Google Scholar] [Publisher Link]
[20] Shyam Ratan, Sonal Sinha, and Siddharth Singh, “SVM for Hate Speech and Offensive Content Detection,” CEUR Workshop Proceedings, pp. 1-8, 2021.
[Google Scholar] [Publisher Link]
[21] Mohit Bhardwaj et al., “Hostility Detection Dataset in Hindi,” arXiv, pp. 1-5, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Omar Sharif, Eftekhar Hossain, and Mohammed Moshiul Hoque, “Combating Hostility: Covid-19 Fake News and Hostile Post Detection in Social Media,” arXiv, pp. 1-11, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Mohammed Azhan, and Mohammad Ahmad, “LaDiff ULMFiT: A Layer Differentiated Training Approach for ULMFiT,” International Workshop on Combating Online Hostile Posts in Regional Languages During Emergency Situation, pp. 54-61, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[24] Chander Shekhar et al., “Walk in Wild: An Ensemble Approach for Hostility Detection in Hindi Posts,” arXiv, pp. 1-10, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Ayush Gupta et al., “Hostility Detection and Covid-19 Fake News Detection in Social Media,” arXiv, pp. 1-13, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Tanzia Parvin, and Mohammed Moshiul Hoque, “An Ensemble Technique to Classify Multiclass Textual Emotion,” Procedia Computer Science, vol. 193, pp. 72-81, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[27] Marcin Michal Mirończuk, and Jaroslaw Protasiewicz, “A Recent Overview of the State-of-the-Art Elements of Text Classification,” Expert Systems with Applications, vol. 106, pp. 36-54, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[28] Nurshahira Endut et al., “A Systematic Literature Review on Multilabel Classification Based on Machine Learning Algorithms,” TEM Journal, vol. 11, no. 2, pp. 658-666, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[29] Bassam Al-Salemi et al., “Multilabel Arabic Text Categorization: A Benchmark and Baseline Comparison of Multilabel Learning Algorithms,” Information Processing and Management, vol. 56, no. 1, pp. 212-227, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[30] Hozayfa El Rifai, Leen Al-Qadi, and Ashraf Elnagar, “Arabic Text Classification: The Need for Multi-Labeling Systems,” Neural Computing and Applications, vol. 34, no. 2, pp. 1135-1159, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[31] David H. Wolpert, “Stacked Generalization,” Neural Networks, vol. 5, no. 2, pp. 241-259, 1992.
[CrossRef] [Google Scholar] [Publisher Link]
[32] Shivang Agarwal, and C. Ravindranath Chowdary, “Combating Hate Speech Using an Adaptive Ensemble Learning Model with a Case Study on COVID-19,” Expert Systems with Applications, vol. 185, pp. 1-9, 2021.
[CrossRef] [Google Scholar] [Publisher Link]