Table of Contents
Fetching ...

Q-Tacit: Image Quality Assessment via Latent Visual Reasoning

Yuxuan Jiang, Yixuan Li, Hanwei Zhu, Siyue Teng, Fan Zhang, David Bull

Abstract

Vision-Language Model (VLM)-based image quality assessment (IQA) has been significantly advanced by incorporating Chain-of-Thought (CoT) reasoning. Recent work has refined image quality reasoning by applying reinforcement learning (RL) and leveraging active visual tools. However, such strategies are typically language-centric, with visual information being treated as static preconditions. Quality-related visual cues often cannot be abstracted into text in extenso due to the gap between discrete textual tokens and quality perception space, which in turn restricts the reasoning effectiveness for visually intensive IQA tasks. In this paper, we revisit this by asking the question, "Is natural language the ideal space for quality reasoning?" and, as a consequence, we propose Q-Tacit, a new paradigm that elicits VLMs to reason beyond natural language in the latent quality space. Our approach follows a synergistic two-stage process: (i) injecting structural visual quality priors into the latent space, and (ii) calibrating latent reasoning trajectories to improve quality assessment ability. Extensive experiments demonstrate that Q-Tacit can effectively perform quality reasoning with significantly fewer tokens than previous reasoning-based methods, while achieving strong overall performance. This paper validates the proposition that language is not the only compact representation suitable for visual quality, opening possibilities for further exploration of effective latent reasoning paradigms for IQA. Source code will be released to support future research.

Q-Tacit: Image Quality Assessment via Latent Visual Reasoning

Abstract

Vision-Language Model (VLM)-based image quality assessment (IQA) has been significantly advanced by incorporating Chain-of-Thought (CoT) reasoning. Recent work has refined image quality reasoning by applying reinforcement learning (RL) and leveraging active visual tools. However, such strategies are typically language-centric, with visual information being treated as static preconditions. Quality-related visual cues often cannot be abstracted into text in extenso due to the gap between discrete textual tokens and quality perception space, which in turn restricts the reasoning effectiveness for visually intensive IQA tasks. In this paper, we revisit this by asking the question, "Is natural language the ideal space for quality reasoning?" and, as a consequence, we propose Q-Tacit, a new paradigm that elicits VLMs to reason beyond natural language in the latent quality space. Our approach follows a synergistic two-stage process: (i) injecting structural visual quality priors into the latent space, and (ii) calibrating latent reasoning trajectories to improve quality assessment ability. Extensive experiments demonstrate that Q-Tacit can effectively perform quality reasoning with significantly fewer tokens than previous reasoning-based methods, while achieving strong overall performance. This paper validates the proposition that language is not the only compact representation suitable for visual quality, opening possibilities for further exploration of effective latent reasoning paradigms for IQA. Source code will be released to support future research.
Paper Structure (28 sections, 5 equations, 6 figures, 6 tables)

This paper contains 28 sections, 5 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Motivation and overview of Q-Tacit. We compare Q-Tacit and Q-Insight li2025qinsight, which perform quality reasoning in a latent space and a text space, respectively. <|lvr_start|> and <|lvr_end|> encapsulate the latent reasoning process, and <|lvr|> is a placeholder for a latent slot token that allocates one latent reasoning step. Q-Tacit excels at image quality scoring on out-of-distribution datasets (Right) while requiring only 10% visible token counts compared to Q-Insight (Left).
  • Figure 2: Overview of Q-Tacit. We introduce a compact latent quality-reasoning segment wrapped by special tokens <|lvr_start|> and <|lvr_end|>, where the model propagates latent embeddings (i.e., internal hidden states) as a compact quality space instead of generating textual rationales. Then directly outputs the final quality score in <answer> and </answer> when a stopping criterion is met.
  • Figure 3: Example training data for Q-Tacit. The left part shows the mixture for constructing the latent space in Stage I. The Bboxes field specifies the ROI to supervise latent reconstruction. <lvr> functions as a placeholder to indicate where the latent segment should be inserted, which will expand to a series of latent span $\texttt{<|lvr\_start|>}\ \texttt{<|lvr|>}... \texttt{<|lvr|>}\texttt{<|lvr\_end|>}$ in practice. The right part is quality-aligned calibration for latent quality reasoning in Stage II.
  • Figure 4: Token-space coupling and attention weight distribution during quality scoring: Q-Insight vs. Q-Tacit. Left/Right: t-SNE visualization of visual and reasoning tokens for Q-Insight/Q-Tacit. Middle: normalized attention weights over the image and reasoning tokens after the quality score is generated. Q-Tacit obtained similar attention allocated to reasoning with only 11 tokens (vs. 112 for Q-Insight).
  • Figure 5: Sensitivity to localized distortions. We compare Q-Insight, VisualQuality-R1, and Q-Tacit by reporting predicted quality scores for each image (w/o distorted patch, top) and its counterpart with a localized corruption (w/ distorted patch, bottom). Blur, noise, and compression are injected within the red box (left to right). Q-Tacit consistently lowers its predicted scores in response to the local distortion.
  • ...and 1 more figures