TLDR: Token-Level Detective Reward Model for Large Vision Language Models

Deqing Fu; Tong Xiao; Rui Wang; Wang Zhu; Pengchuan Zhang; Guan Pang; Robin Jia; Lawrence Chen

TLDR: Token-Level Detective Reward Model for Large Vision Language Models

Deqing Fu, Tong Xiao, Rui Wang, Wang Zhu, Pengchuan Zhang, Guan Pang, Robin Jia, Lawrence Chen

TL;DR

TLDR introduces a token-level reward mechanism for vision-language models that provides per-token feedback $\gamma(e_k|m,p,d) \in [0,1]$, addressing the granularity gap of traditional binary rewards. A perturbation-based synthetic data pipeline generates hard negatives and token-level labels, enabling on-policy RLHF-like optimization and improved visual grounding. Empirical results show TLDR improves token-, sentence-, and response-level metrics over naive binary rewards, supports effective self-correction, and enables hallucination evaluation; it also accelerates human annotation by about 3x. The work demonstrates that token-level supervision can be automatically leveraged to tune backbone VLMs via token-level likelihood optimization, offering a practical path toward more grounded, safe, and data-efficient vision-language systems.

Abstract

Although reward models have been successful in improving multimodal large language models, the reward models themselves remain brutal and contain minimal information. Notably, existing reward models only mimic human annotations by assigning only one binary feedback to any text, no matter how long the text is. In the realm of multimodal language models, where models are required to process both images and texts, a naive reward model may learn implicit biases toward texts and become less grounded in images. In this paper, we propose a $\textbf{T}$oken-$\textbf{L}$evel $\textbf{D}$etective $\textbf{R}$eward Model ($\textbf{TLDR}$) to provide fine-grained annotations to each text token. We first introduce a perturbation-based method to generate synthetic hard negatives and their token-level labels to train TLDR models. Then we show the rich usefulness of TLDR models both in assisting off-the-shelf models to self-correct their generations, and in serving as a hallucination evaluation tool. We show that TLDR automatically trains a token-level likelihood optimization, and can improve the base model's performance significantly. Finally, we show that TLDR models can significantly speed up human annotation by 3 times to acquire a broader range of high-quality vision language data.

TLDR: Token-Level Detective Reward Model for Large Vision Language Models

TL;DR

TLDR introduces a token-level reward mechanism for vision-language models that provides per-token feedback

, addressing the granularity gap of traditional binary rewards. A perturbation-based synthetic data pipeline generates hard negatives and token-level labels, enabling on-policy RLHF-like optimization and improved visual grounding. Empirical results show TLDR improves token-, sentence-, and response-level metrics over naive binary rewards, supports effective self-correction, and enables hallucination evaluation; it also accelerates human annotation by about 3x. The work demonstrates that token-level supervision can be automatically leveraged to tune backbone VLMs via token-level likelihood optimization, offering a practical path toward more grounded, safe, and data-efficient vision-language systems.

Abstract

oken-

evel

etective

eward Model (

) to provide fine-grained annotations to each text token. We first introduce a perturbation-based method to generate synthetic hard negatives and their token-level labels to train TLDR models. Then we show the rich usefulness of TLDR models both in assisting off-the-shelf models to self-correct their generations, and in serving as a hallucination evaluation tool. We show that TLDR automatically trains a token-level likelihood optimization, and can improve the base model's performance significantly. Finally, we show that TLDR models can significantly speed up human annotation by 3 times to acquire a broader range of high-quality vision language data.

Paper Structure (31 sections, 9 equations, 6 figures, 11 tables)

This paper contains 31 sections, 9 equations, 6 figures, 11 tables.

Introduction
Related Work
Reinforcement Learning from Human Feedback and Reward Model.
Synthetic Data and Hard Negative Mining.
Large Vision Language Models and Evaluation.
Problem Setup
Token-Level Accuracy.
Sentence-Level Accuracy.
Response-Level Accuracy.
Synthetic Data Generation
Experiments
Training TLDR Models
Model Architecture.
Training.
Evaluation.
...and 16 more sections

Figures (6)

Figure 1: Token-Level Detective Reward (TLDR) Model. It can be used as hallucination detection, and to prompt models to self-correct with these detection. TLDR can also speed up human annotation speed to fix slightly mistaken image captions, to create high-quality vision language data.
Figure 2: TLDR Model Architecture. For any instance with image $m$, prompt $p$, and a response $d$, they are passed altogether into the large VLM backbone $f$ without the language model head $\ell$. Then a shared reward model head $h$ is applied to every token $e_k$ of the response $d$ to have binary predictions $\gamma(e_k)$ to determine if $e_k$ is a good token or a bad token.
Figure 3: Level of Granularity in Hallucination Rate. Using the example from \ref{['fig:main-figure']}, we can easily compute token-level hallucination rates following \ref{['eqn:token-level-hrate']}. Then tokens are grouped into sentences which are separated by period marks. An entire sentence with at least one bad token is highlighted as a bad sentence. Then the sentence-level hallucination rate of one response is calculated by counting the proportion of bad sentences. Similarly, if there is at least one bad token in the response, the entire response is a bad one. Hallucination rates are averaged over an entire evaluation set to determine the overall hallucination rates of a model.
Figure 4: TLDR model can guidance existing VLMs to Self-Correct their hallucination when generating captions for images from WinoGround Thrush2022WinogroundPV.
Figure 5: There is a strong linear correlation between model performance (evaluated by MMMU) and the negative log hallucination rate $-\log(\mathcal{H}_T)$, which is an approxy to model's negative log-likelihood of producing a correct token. The $p$-value of this linear correlation is 3.458e-4.
...and 1 more figures

TLDR: Token-Level Detective Reward Model for Large Vision Language Models

TL;DR

Abstract

TLDR: Token-Level Detective Reward Model for Large Vision Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)