Table of Contents
Fetching ...

Token-Level Marginalization for Multi-Label LLM Classifiers

Anjaneya Praharaj, Jaykumar Kasundra

TL;DR

Generative LLMs used for multi-label content safety classification lack per-label confidence scores, hindering thresholding and error analysis. The authors propose a token-level probability framework to derive category-level confidence from autoregressive outputs using conditional, joint, and marginal estimation strategies, coupled with constrained decoding. On a synthetic, rigorously annotated dataset, the Marginal probability approach yields the strongest F1 and AUC, outperforming conditional and joint methods and baselines. The results demonstrate that interpretable, calibrated confidence can be extracted from generative models and generalizes to instruction-tuned variants, enabling finer-grained moderation decisions.

Abstract

This paper addresses the critical challenge of deriving interpretable confidence scores from generative language models (LLMs) when applied to multi-label content safety classification. While models like LLaMA Guard are effective for identifying unsafe content and its categories, their generative architecture inherently lacks direct class-level probabilities, which hinders model confidence assessment and performance interpretation. This limitation complicates the setting of dynamic thresholds for content moderation and impedes fine-grained error analysis. This research proposes and evaluates three novel token-level probability estimation approaches to bridge this gap. The aim is to enhance model interpretability and accuracy, and evaluate the generalizability of this framework across different instruction-tuned models. Through extensive experimentation on a synthetically generated, rigorously annotated dataset, it is demonstrated that leveraging token logits significantly improves the interpretability and reliability of generative classifiers, enabling more nuanced content safety moderation.

Token-Level Marginalization for Multi-Label LLM Classifiers

TL;DR

Generative LLMs used for multi-label content safety classification lack per-label confidence scores, hindering thresholding and error analysis. The authors propose a token-level probability framework to derive category-level confidence from autoregressive outputs using conditional, joint, and marginal estimation strategies, coupled with constrained decoding. On a synthetic, rigorously annotated dataset, the Marginal probability approach yields the strongest F1 and AUC, outperforming conditional and joint methods and baselines. The results demonstrate that interpretable, calibrated confidence can be extracted from generative models and generalizes to instruction-tuned variants, enabling finer-grained moderation decisions.

Abstract

This paper addresses the critical challenge of deriving interpretable confidence scores from generative language models (LLMs) when applied to multi-label content safety classification. While models like LLaMA Guard are effective for identifying unsafe content and its categories, their generative architecture inherently lacks direct class-level probabilities, which hinders model confidence assessment and performance interpretation. This limitation complicates the setting of dynamic thresholds for content moderation and impedes fine-grained error analysis. This research proposes and evaluates three novel token-level probability estimation approaches to bridge this gap. The aim is to enhance model interpretability and accuracy, and evaluate the generalizability of this framework across different instruction-tuned models. Through extensive experimentation on a synthetically generated, rigorously annotated dataset, it is demonstrated that leveraging token logits significantly improves the interpretability and reliability of generative classifiers, enabling more nuanced content safety moderation.

Paper Structure

This paper contains 16 sections, 5 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: We explore Conditional, Joint, and Marginal probability-based approaches to estimate model confidence. The category labels (e.g., S1, S3, etc.) correspond to classes defined in the LLaMA Guard taxonomy and are treated as tokens for simplicity.
  • Figure 2: An overview of the synthetic data generation pipeline used for generating the evaluation data. The models employed in this process include Qwen/QwQ-32B qwq_huihui, Meta-Llama/Llama-3.3-70B-Instruct llama3_huihui, and Microsoft/Phi-3-mini-128k-Instruct dolphin_phi3. Abliterated versions of these models were utilized to enable the generation of unsafe and offensive content.