Table of Contents
Fetching ...

The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?

Qinyu Zhao, Ming Xu, Kartik Gupta, Akshay Asthana, Liang Zheng, Stephen Gould

TL;DR

The paper demonstrates that the logit distribution of the initial output token in large vision-language models contains strong signals about when to refrain from answering unsafe prompts, enabling a simple, data-efficient decoding technique guided by linear probing. By evaluating across multiple LVLMs and safety-related tasks (unanswerable VQA, jailbreaking, deception) and comparing to CLIP baselines, it shows that the first-token logits encode hidden knowledge that deteriorates over subsequent tokens. The authors also show that linear probing on the first token improves several downstream tasks (math problem uncertainty, hallucination mitigation, image classification) and that finetuning/retraining, while beneficial, generally lags behind linear probing in this setting. This approach offers a lightweight safety augmentation that can complement or substitute for heavy retraining, while highlighting dataset biases and the strong influence of CLIP components in multi-modal models.

Abstract

Large vision-language models (LVLMs), designed to interpret and respond to human instructions, occasionally generate hallucinated or harmful content due to inappropriate instructions. This study uses linear probing to shed light on the hidden knowledge at the output layers of LVLMs. We demonstrate that the logit distributions of the first tokens contain sufficient information to determine whether to respond to the instructions, including recognizing unanswerable visual questions, defending against jailbreaking attacks, and identifying deceptive questions. Such hidden knowledge is gradually lost in logits of subsequent tokens during response generation. Then, we illustrate a simple decoding strategy at the generation of the first token, effectively improving the generated content. In experiments, we find a few interesting insights: First, the CLIP model already contains a strong signal for solving these tasks, which indicates potential bias in the existing datasets. Second, we observe performance improvement by utilizing the first logit distributions on three additional tasks, including indicating uncertainty in math solving, mitigating hallucination, and image classification. Last, with the same training data, simply finetuning LVLMs improves models' performance but is still inferior to linear probing on these tasks.

The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?

TL;DR

The paper demonstrates that the logit distribution of the initial output token in large vision-language models contains strong signals about when to refrain from answering unsafe prompts, enabling a simple, data-efficient decoding technique guided by linear probing. By evaluating across multiple LVLMs and safety-related tasks (unanswerable VQA, jailbreaking, deception) and comparing to CLIP baselines, it shows that the first-token logits encode hidden knowledge that deteriorates over subsequent tokens. The authors also show that linear probing on the first token improves several downstream tasks (math problem uncertainty, hallucination mitigation, image classification) and that finetuning/retraining, while beneficial, generally lags behind linear probing in this setting. This approach offers a lightweight safety augmentation that can complement or substitute for heavy retraining, while highlighting dataset biases and the strong influence of CLIP components in multi-modal models.

Abstract

Large vision-language models (LVLMs), designed to interpret and respond to human instructions, occasionally generate hallucinated or harmful content due to inappropriate instructions. This study uses linear probing to shed light on the hidden knowledge at the output layers of LVLMs. We demonstrate that the logit distributions of the first tokens contain sufficient information to determine whether to respond to the instructions, including recognizing unanswerable visual questions, defending against jailbreaking attacks, and identifying deceptive questions. Such hidden knowledge is gradually lost in logits of subsequent tokens during response generation. Then, we illustrate a simple decoding strategy at the generation of the first token, effectively improving the generated content. In experiments, we find a few interesting insights: First, the CLIP model already contains a strong signal for solving these tasks, which indicates potential bias in the existing datasets. Second, we observe performance improvement by utilizing the first logit distributions on three additional tasks, including indicating uncertainty in math solving, mitigating hallucination, and image classification. Last, with the same training data, simply finetuning LVLMs improves models' performance but is still inferior to linear probing on these tasks.
Paper Structure (27 sections, 3 equations, 7 figures, 3 tables)

This paper contains 27 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Scenarios where LVLMs may make undesirable responses. The first tokens are emphasized in italics for clarity. Note that although the first tokens are usually nondescript words like "the" and "1", we find that the logit vectors of them are actually very informative for determining the proper responses.
  • Figure 2: Illustration of a possible application by using our linear probing method. Given a text prompt and an image, we take the logit vector of the first token and feed it into the linear probing module (logistic regression for binary classification and Linear Discriminant Analysis for multi-way classification). The classifiers are trained on different tasks, such as answerable vs. unanswerable, correct vs. incorrect answers, etc.
  • Figure 3: Further analysis using linear probing. The y-axes represent the AUC or ACC differences between different settings and the first logit distribution. (a-b) The logit distributions of subsequent tokens show sub-optimal performance compared to the first token, while the last token shows competitive results. "<E>" is the special token indicating the end of generation. (c-d) We also train linear probing modules on the hidden states of the first generated tokens. The middle hidden states show better or comparable performance, whereas the last hidden states are usually sub-optimal.
  • Figure 4: A simple decoding strategy is to substitute the first token with a manually designed template, based on the results of linear probing. For tasks with short answers, such as image classification and answering yes-or-no questions, a straightforward candidate answer can be returned. When models are faced with unanswerable or deceptive questions, or jailbreaking attacks, we use various templates (colored in orange) ending with "because", and ask LVLMs to complete the responses.
  • Figure 5: Method comparison for predicting whether the answer to a math problem is correct. "SelfEval" means asking an LVLM to determine whether its own answer to a problem is correct. "SelfEval+LP" means using linear probing on the first logit distribution of self-evaluation. "OE+LP" indicates linear probing on the first logit distribution when LVLMs start generating the original answers. From (a) to (f): LLaVA, InstructBLIP, mPLUG-Owl, LLaMA-Adapter, MMGPT and MiniGPT4. We find "SelfEval+LP" and "OE+LP" have similar performance, and both are better than "SelfEval".
  • ...and 2 more figures