Table of Contents
Fetching ...

A Unified Evaluation Framework for Multi-Annotator Tendency Learning

Liyun Zhang, Fengkai Liu, Xuanmeng Sha, Bowen Wang, Hong Liu, Zheng Lian

TL;DR

The paper addresses the lack of principled evaluation for Individual Tendency Learning (ITL) in multi-annotator settings. It proposes a unified framework with two metrics: Difference of Inter-annotator Consistency (DIC) to measure how well models preserve annotator tendency structures, and Behavior Alignment Explainability (BAE) to assess whether explanations reflect true annotator behaviors via Multidimensional Scaling (MDS). The framework is validated on AMER and STREET datasets across four ITL models, with QuMAB achieving the best performance on both tendency capture (lowest DIC) and explanatory alignment (highest BAE), and ablation studies confirming metric sensitivity. This work enables principled comparisons of ITL methods and sets the stage for incorporating richer behavioral signals into evaluation in the future.

Abstract

Recent works have emerged in multi-annotator learning that shift focus from Consensus-oriented Learning (CoL), which aggregates multiple annotations into a single ground-truth prediction, to Individual Tendency Learning (ITL), which models annotator-specific labeling behavior patterns (i.e., tendency) to provide explanation analysis for understanding annotator decisions. However, no evaluation framework currently exists to assess whether ITL methods truly capture individual tendencies and provide meaningful behavioral explanations. To address this gap, we propose the first unified evaluation framework with two novel metrics: (1) Difference of Inter-annotator Consistency (DIC) quantifies how well models capture annotator tendencies by comparing predicted inter-annotator similarity structures with ground-truth; (2) Behavior Alignment Explainability (BAE) evaluates how well model explanations reflect annotator behavior and decision relevance by aligning explainability-derived with ground-truth labeling similarity structures via Multidimensional Scaling (MDS). Extensive experiments validate the effectiveness of our proposed evaluation framework.

A Unified Evaluation Framework for Multi-Annotator Tendency Learning

TL;DR

The paper addresses the lack of principled evaluation for Individual Tendency Learning (ITL) in multi-annotator settings. It proposes a unified framework with two metrics: Difference of Inter-annotator Consistency (DIC) to measure how well models preserve annotator tendency structures, and Behavior Alignment Explainability (BAE) to assess whether explanations reflect true annotator behaviors via Multidimensional Scaling (MDS). The framework is validated on AMER and STREET datasets across four ITL models, with QuMAB achieving the best performance on both tendency capture (lowest DIC) and explanatory alignment (highest BAE), and ablation studies confirming metric sensitivity. This work enables principled comparisons of ITL methods and sets the stage for incorporating richer behavioral signals into evaluation in the future.

Abstract

Recent works have emerged in multi-annotator learning that shift focus from Consensus-oriented Learning (CoL), which aggregates multiple annotations into a single ground-truth prediction, to Individual Tendency Learning (ITL), which models annotator-specific labeling behavior patterns (i.e., tendency) to provide explanation analysis for understanding annotator decisions. However, no evaluation framework currently exists to assess whether ITL methods truly capture individual tendencies and provide meaningful behavioral explanations. To address this gap, we propose the first unified evaluation framework with two novel metrics: (1) Difference of Inter-annotator Consistency (DIC) quantifies how well models capture annotator tendencies by comparing predicted inter-annotator similarity structures with ground-truth; (2) Behavior Alignment Explainability (BAE) evaluates how well model explanations reflect annotator behavior and decision relevance by aligning explainability-derived with ground-truth labeling similarity structures via Multidimensional Scaling (MDS). Extensive experiments validate the effectiveness of our proposed evaluation framework.

Paper Structure

This paper contains 13 sections, 7 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Top: A paradigm focus shift from Consensus-oriented Learning (CoL) to Individual Tendency Learning (ITL). A video sample $V$ is processed by multi-annotator models. CoL aggregates annotators' predictions into a single ground-truth prediction. While ITL models annotator-specific labeling behavior pattern (i.e., tendency) to give different attention change explanations along video frames for understanding annotator decisions. Bottom: An evaluation framework for ITL: Difference of Inter-annotator Consistency (DIC) quantifies how well the model captures annotator tendencies by comparing the structure of predicted inter-annotator similarities with ground-truth; Behavior Alignment Explainability (BAE) evaluates how well model explanations reflect annotator behavior and decision relevance by aligning explainability-derived with ground-truth labeling similarity structures via Multidimensional Scaling (MDS).
  • Figure 2: Proposed evaluation framework for inter-annotator behavioral analysis. (a) Difference of Inter-annotator Consistency (DIC) quantifies how well a model preserves annotator tendency by comparing ground-truth and predicted similarity matrices using Frobenius norm. (b) Behavior Alignment Explainability (BAE) assesses whether model explanations capture true inter-annotator behavioral structures using Multidimensional Scaling (MDS) projection. BAE is computed at two complementary levels: feature-level, based on learned annotator representations, and region-level, based on attention-derived focus regions (for attention-based models). Both measure alignment against the ground-truth consistency matrix.
  • Figure 3: Visualization analysis about difference of inter-annotator consistency (DIC) via similarity matrices calculated by Cohen’s kappa coefficient on the STREET dataset (safety perspective) (10 annotators), darker colors indicate stronger agreement. Four representative models are compared with the ground truth. Lower DIC scores reflect better preservation of the underlying consistency structure. The vertical color bar denotes the similarity scale ranging from 0 (no agreement) to 1 (perfect agreement).
  • Figure 4: Visualization analysis about behavior alignment explainability (BAE) via 2D projection of annotator representations using Multidimensional Scaling (MDS) on the STREET dataset (safety perspective). Results show progressively improved alignment with higher BAE scores. QuMAB-region provides a complementary view by incorporating attention-over-region patterns. Each point denotes an annotator, and proximity indicates higher behavioral similarity. Same colors denote clusters of annotators with strong agreement ($\kappa > 0.6$).