Table of Contents
Fetching ...

KTCR: Improving Implicit Hate Detection with Knowledge Transfer driven Concept Refinement

Samarth Garg, Vivek Hruday Kavuri, Gargi Shroff, Rahul Mishra

TL;DR

KTCR tackles the evolving challenge of implicit hate speech detection by introducing a knowledge transfer–driven concept refinement framework that distills and refines implicit-hate concepts via a teacher–student setup, autoencoder-based activation mapping, prototype alignment, and a concept loss guided by Concept Activation Vectors. It combines Degree Of Explicitness–based data selection with augmentation of concept and random sets, enabling robust cross-dataset generalization while preserving explicit hate detection. Experiments on Wiki, EA, and CH datasets demonstrate improvements in F1 and AUC over DoE baselines and ablations, with insights from activation and gradient analyses supporting refined internal representations for implicit patterns. The approach offers interpretable, adaptable hate-speech detection that remains effective as new hate forms emerge, with potential applicability to related socially sensitive tasks.

Abstract

The constant shifts in social and political contexts, driven by emerging social movements and political events, lead to new forms of hate content and previously unrecognized hate patterns that machine learning models may not have captured. Some recent literature proposes data augmentation-based techniques to enrich existing hate datasets by incorporating samples that reveal new implicit hate patterns. This approach aims to improve the model's performance on out-of-domain implicit hate instances. It is observed, that further addition of more samples for augmentation results in the decrease of the performance of the model. In this work, we propose a Knowledge Transfer-driven Concept Refinement method that distills and refines the concepts related to implicit hate samples through novel prototype alignment and concept losses, alongside data augmentation based on concept activation vectors. Experiments with several publicly available datasets show that incorporating additional implicit samples reflecting new hate patterns through concept refinement enhances the model's performance, surpassing baseline results while maintaining cross-dataset generalization capabilities.

KTCR: Improving Implicit Hate Detection with Knowledge Transfer driven Concept Refinement

TL;DR

KTCR tackles the evolving challenge of implicit hate speech detection by introducing a knowledge transfer–driven concept refinement framework that distills and refines implicit-hate concepts via a teacher–student setup, autoencoder-based activation mapping, prototype alignment, and a concept loss guided by Concept Activation Vectors. It combines Degree Of Explicitness–based data selection with augmentation of concept and random sets, enabling robust cross-dataset generalization while preserving explicit hate detection. Experiments on Wiki, EA, and CH datasets demonstrate improvements in F1 and AUC over DoE baselines and ablations, with insights from activation and gradient analyses supporting refined internal representations for implicit patterns. The approach offers interpretable, adaptable hate-speech detection that remains effective as new hate forms emerge, with potential applicability to related socially sensitive tasks.

Abstract

The constant shifts in social and political contexts, driven by emerging social movements and political events, lead to new forms of hate content and previously unrecognized hate patterns that machine learning models may not have captured. Some recent literature proposes data augmentation-based techniques to enrich existing hate datasets by incorporating samples that reveal new implicit hate patterns. This approach aims to improve the model's performance on out-of-domain implicit hate instances. It is observed, that further addition of more samples for augmentation results in the decrease of the performance of the model. In this work, we propose a Knowledge Transfer-driven Concept Refinement method that distills and refines the concepts related to implicit hate samples through novel prototype alignment and concept losses, alongside data augmentation based on concept activation vectors. Experiments with several publicly available datasets show that incorporating additional implicit samples reflecting new hate patterns through concept refinement enhances the model's performance, surpassing baseline results while maintaining cross-dataset generalization capabilities.

Paper Structure

This paper contains 23 sections, 4 equations, 4 figures, 5 tables, 2 algorithms.

Figures (4)

  • Figure 1: Comparison of the performance of traditional classifiers and the proposed KTCR method. Traditional classifiers often correctly identify explicit hate speech (e.g., "Go back to your country!") but fail to detect implicit hate speech (e.g., "They always bring disease"). The KTCR method, however, successfully identifies both explicit and implicit hate speech as hate, improving overall detection accuracy.
  • Figure 2: Overview of the Knowledge Transfer via Concept Refinement (KTCR) framework. The teacher model ($\mathbf{M_T}$), trained on explicit hate samples, guides the student model ($\mathbf{M_S}$) to refine its understanding of implicit hate through autoencoder-based activation mapping, prototype alignment, and concept loss. The autoencoder maps the teacher's activations ($g_t$) to the student's activations ($g_s$), while prototype alignment is used to refine the learned representations.
  • Figure 3: Activation and Gradient Norms Before Applying KTCR
  • Figure 4: Activation and Gradient Norms After Applying KTCR