Table of Contents
Fetching ...

Same Words, Different Judgments: Modality Effects on Preference Alignment

Aaron Broukhim, Nadir Weibel, Eshin Jolly

TL;DR

A controlled cross-modal study of human and synthetic preference annotations, comparing text and audio evaluations of identical semantic content across 100 prompts shows audio preferences prove as reliable as text, and ICC-based reliability characterization is presented -- the first ICC-based reliability characterization in the preference annotation literature for either modality.

Abstract

Preference-based reinforcement learning (PbRL) is the dominant framework for aligning AI systems to human preferences, but its application to speech remains underexplored. We present a controlled cross-modal study of human and synthetic preference annotations, comparing text and audio evaluations of identical semantic content across 100 prompts. Audio preferences prove as reliable as text, with inter-rater agreement reaching good levels (ICC(2,k) $\approx$ .80) at $\sim$9 raters -- the first ICC-based reliability characterization in the preference annotation literature for either modality. However, modality reshapes how people judge: audio raters exhibit narrower decision thresholds, reduced length bias, and more user-oriented evaluation criteria, with near-chance cross-modality agreement. Synthetic ratings further align with human judgments and predict inter-rater agreement, supporting their use both for triaging ambiguous pairs and as full replacements for human annotations.

Same Words, Different Judgments: Modality Effects on Preference Alignment

TL;DR

A controlled cross-modal study of human and synthetic preference annotations, comparing text and audio evaluations of identical semantic content across 100 prompts shows audio preferences prove as reliable as text, and ICC-based reliability characterization is presented -- the first ICC-based reliability characterization in the preference annotation literature for either modality.

Abstract

Preference-based reinforcement learning (PbRL) is the dominant framework for aligning AI systems to human preferences, but its application to speech remains underexplored. We present a controlled cross-modal study of human and synthetic preference annotations, comparing text and audio evaluations of identical semantic content across 100 prompts. Audio preferences prove as reliable as text, with inter-rater agreement reaching good levels (ICC(2,k) .80) at 9 raters -- the first ICC-based reliability characterization in the preference annotation literature for either modality. However, modality reshapes how people judge: audio raters exhibit narrower decision thresholds, reduced length bias, and more user-oriented evaluation criteria, with near-chance cross-modality agreement. Synthetic ratings further align with human judgments and predict inter-rater agreement, supporting their use both for triaging ambiguous pairs and as full replacements for human annotations.
Paper Structure (27 sections, 4 figures, 6 tables)

This paper contains 27 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Interface for the Audio Task of preference selection. It includes conversation history, most recent query and two responses. Raters must rate each response from 1-100 on response quality (the content of the answer) and audio quality (the clarity and naturalness of speech). We also collect discrete preferences with the option to tie and optional free-response reasoning justification.
  • Figure 2: Cumulative distribution of the mean per-prompt rating gap on trials where raters declared a preference (A or B rather than Tie).
  • Figure 3: Cross-modality agreement between audio and text evaluations as a function of raw rating gap threshold. The blue line shows agreement percentage when both modalities produce a decisive winner (excluding ties), while the red line shows agreement when tie-versus-decisive outcomes are counted as disagreements. Green bars indicate the number of prompts where both modalities had a decisive winner at each threshold.
  • Figure 4: ICC(2, k) as a function of the number of raters, derived from variance components of a crossed random-effects model. Shaded bands indicate conservative interpretation thresholds per koo2016guideline. The observed average number of raters per stimulus was $k \approx 9.2$ (audio) and $k \approx 8.9$ (text).