Table of Contents
Fetching ...

Soft-Label Integration for Robust Toxicity Classification

Zelei Cheng, Xian Wu, Jiahao Yu, Shuo Han, Xin-Qiang Cai, Xinyu Xing

TL;DR

This work introduces a novel bi-level optimization framework that integrates crowdsourced annotations with the soft-labeling technique and optimizes the soft-label weights by Group Distributionally Robust Optimization (GroupDRO) to enhance the robustness against out-of-distribution risk.

Abstract

Toxicity classification in textual content remains a significant problem. Data with labels from a single annotator fall short of capturing the diversity of human perspectives. Therefore, there is a growing need to incorporate crowdsourced annotations for training an effective toxicity classifier. Additionally, the standard approach to training a classifier using empirical risk minimization (ERM) may fail to address the potential shifts between the training set and testing set due to exploiting spurious correlations. This work introduces a novel bi-level optimization framework that integrates crowdsourced annotations with the soft-labeling technique and optimizes the soft-label weights by Group Distributionally Robust Optimization (GroupDRO) to enhance the robustness against out-of-distribution (OOD) risk. We theoretically prove the convergence of our bi-level optimization algorithm. Experimental results demonstrate that our approach outperforms existing baseline methods in terms of both average and worst-group accuracy, confirming its effectiveness in leveraging crowdsourced annotations to achieve more effective and robust toxicity classification.

Soft-Label Integration for Robust Toxicity Classification

TL;DR

This work introduces a novel bi-level optimization framework that integrates crowdsourced annotations with the soft-labeling technique and optimizes the soft-label weights by Group Distributionally Robust Optimization (GroupDRO) to enhance the robustness against out-of-distribution risk.

Abstract

Toxicity classification in textual content remains a significant problem. Data with labels from a single annotator fall short of capturing the diversity of human perspectives. Therefore, there is a growing need to incorporate crowdsourced annotations for training an effective toxicity classifier. Additionally, the standard approach to training a classifier using empirical risk minimization (ERM) may fail to address the potential shifts between the training set and testing set due to exploiting spurious correlations. This work introduces a novel bi-level optimization framework that integrates crowdsourced annotations with the soft-labeling technique and optimizes the soft-label weights by Group Distributionally Robust Optimization (GroupDRO) to enhance the robustness against out-of-distribution (OOD) risk. We theoretically prove the convergence of our bi-level optimization algorithm. Experimental results demonstrate that our approach outperforms existing baseline methods in terms of both average and worst-group accuracy, confirming its effectiveness in leveraging crowdsourced annotations to achieve more effective and robust toxicity classification.

Paper Structure

This paper contains 28 sections, 3 theorems, 28 equations, 8 figures, 9 tables, 1 algorithm.

Key Result

Theorem 3.4

Under Assumption assumption:smooth and Assumption assumption:lower_bound_grad and setting the step size $\mu \leq \frac{2k}{L}$, our bi-level optimization algorithm can ensure that the risk function $\mathcal{R}$ monotonically decreases with respect to the time step $t$, i.e., The equality in Eqn. equation:convergence holds if the gradient of the risk function $\mathcal{R}$ with respect to $w$ bec

Figures (8)

  • Figure 1: An example of a toxic response with the spurious feature "I must remind you that". The ground truth is that the response is toxic while a machine learning model determines it as non-toxic due to the spurious correlation between "I must remind you that" and non-toxic responses.
  • Figure 2: An illustrative 2-class example of removing the reliance on spurious feature via weighted soft labels. Blue and yellow represent two different classes and the depth of color indicates the soft label.
  • Figure 3: Comparison of our method with individual annotators on Q-A and R-A datasets. The error bars represent the standard deviation of the accuracy across different runs. Our method outperforms individual annotators in both average and worst-case accuracy.
  • Figure 4: Comparison of Average and Worst-Group Accuracy of Different Methods with Fewer Annotators. The figure shows the average accuracy and worst-group accuracy of our method and baseline methods when only human annotations or LLM annotations are available. Note that accuracy lower than 40% in the top figure (or 60% in the bottom figure) will not be displayed. Our method outperforms all baseline methods with fewer annotators.
  • Figure 5: Annotation examples of the question dataset. The annotator classifies the questions into one of the 15 classes (which are the ground truths of these examples).
  • ...and 3 more figures

Theorems & Definitions (6)

  • Theorem 3.4: Convergence
  • Theorem 3.5: Convergence rate
  • Lemma 1: sohrab2003basic
  • proof
  • proof
  • proof