Table of Contents
Fetching ...

ClipTTT: CLIP-Guided Test-Time Training Helps LVLMs See Better

Mriganka Nath, Anurag Das, Jiahao Xie, Bernt Schiele

Abstract

Large vision-language models (LVLMs) tend to hallucinate, especially when visual inputs are corrupted at test time. We show that such corruptions act as additional distribution shifts, significantly amplifying hallucination rates in real-world applications. To address this, we propose CLIP-guided Test-Time Training (ClipTTT), a method to adapt LVLMs under degraded conditions on the fly with a single test sample. Specifically, we leverage the image-text alignment strength of a pre-trained CLIP model as a stable guidance signal to identify reliable self-supervision targets, enabling rapid adaptation without altering the base LVLMs. Extensive experiments on standard hallucination benchmarks, with 15 common corruptions, demonstrate that ClipTTT effectively mitigates hallucinations and improves descriptive faithfulness under visual corruptions.

ClipTTT: CLIP-Guided Test-Time Training Helps LVLMs See Better

Abstract

Large vision-language models (LVLMs) tend to hallucinate, especially when visual inputs are corrupted at test time. We show that such corruptions act as additional distribution shifts, significantly amplifying hallucination rates in real-world applications. To address this, we propose CLIP-guided Test-Time Training (ClipTTT), a method to adapt LVLMs under degraded conditions on the fly with a single test sample. Specifically, we leverage the image-text alignment strength of a pre-trained CLIP model as a stable guidance signal to identify reliable self-supervision targets, enabling rapid adaptation without altering the base LVLMs. Extensive experiments on standard hallucination benchmarks, with 15 common corruptions, demonstrate that ClipTTT effectively mitigates hallucinations and improves descriptive faithfulness under visual corruptions.

Paper Structure

This paper contains 14 sections, 8 equations, 12 figures, 14 tables, 1 algorithm.

Figures (12)

  • Figure 1: ClipTTT improves robustness and reduces hallucinations under corruptions.Left: LVLMs suffer from hallucinations. Such an issue becomes worse with corruption, indicating that degradation leads to unreliable generation. Right: We benchmark the hallucinations on CHAIR rohrbach2018object across 15 corruptions adopted in hendrycks2019benchmarking. Naïvely applying CLIP for test-time response selection may even degrade the performance. In contrast, by performing ClipTTT at test time, we consistently reduce the hallucinations of the vanilla models ($\text{CHAIR}_\text{S}$$\downarrow$ and $\text{CHAIR}_\text{I}$$\downarrow$).
  • Figure 2: Left: Hallucinations increase with corruption severities, indicating that stronger degradations reduce generation reliability (example corruption: Zoom Blur). Right: Captions with higher image-text alignment (CLIP Score) tend to be more factually correct. Results are averaged over 15 corruption types. Both experiments use LLaVA-v1.5-7B liu2024improved.
  • Figure 3: Overview of our ClipTTT framework. For each single corrupted test input, we employ a student-teacher framework for on-the-fly adaptation. (1) The Teacher model generates $n$ diverse caption candidates via sampling. (2) An external CLIP model scores each candidate, and the one with the highest visual-semantic alignment is selected as the pseudo-label. (3) The Student model is trained for one step on this pseudo-label, with gradients updating only its parameter-efficient LoRA weights. (4) The Teacher's LoRA weights are then updated via an exponential moving average (EMA) of the Student's, ensuring a stable yet progressively improving training target.
  • Figure 4: Left: CLIP Score distribution comparison. Kernel density estimates of CLIP Scores for captions of Clean images (upper bound), corrupted images with baseline Greedy decoding, and corrupted images after applying Our Method (ClipTTT). Our method shifts the degraded distribution significantly towards the clean distribution. Right: Improvement trend by baseline score. The average improvement provided by ClipTTT is higher when the initial baseline caption has a low-to-medium quality score. The method adaptively reduces its intervention as the baseline quality improves.
  • Figure 5: Left: Ablation on corruption severity. Sev. 1 and Sev. 3 CHAIR scores for Greedy Decoding, training-free test-time methods (VAP zhang2025poison, VCD leng2024mitigating), and ClipTTT. Right: Inference time vs. performance tradeoff. Tradeoff curve of ClipTTT at different iterations and against different test-time approaches.
  • ...and 7 more figures