Table of Contents
Fetching ...

Building Reasonable Inference for Vision-Language Models in Blind Image Quality Assessment

Yuan Li, Zitang Sun, Yen-ju Chen, Shin'ya Nishida

TL;DR

Vision-language BIQA models often produce contradictory reasoning and unstable quality predictions. The authors analyze how final quality scores relate to basic visual features, inspect middle-layer inference with a logit-lens approach, and introduce a two-stage tuning strategy that grounds quality inference in fundamental visual descriptions. The approach yields reduced prediction instability and consistent SRCC/PLCC gains across SPAQ, KONIQ, LIVE, and CSIQ datasets, while providing deeper insights into inference dynamics via attention and latent-output analyses. This work advances interpretable, human-aligned BIQA by separating visual perception from quality inference and demonstrating tangible improvements in robustness and reasoning coherence.

Abstract

Recent progress in BIQA has been driven by VLMs, whose semantic reasoning abilities suggest that they might extract visual features, generate descriptive text, and infer quality in a human-like manner. However, these models often produce textual descriptions that contradict their final quality predictions, and the predicted scores can change unstably during inference - behaviors not aligned with human reasoning. To understand these issues, we analyze the factors that cause contradictory assessments and instability. We first estimate the relationship between the final quality predictions and the generated visual features, finding that the predictions are not fully grounded in the features and that the logical connection between them is weak. Moreover, decoding intermediate VLM layers shows that the model frequently relies on a limited set of candidate tokens, which contributes to prediction instability. To encourage more human-like reasoning, we introduce a two-stage tuning method that explicitly separates visual perception from quality inference. In the first stage, the model learns visual features; in the second, it infers quality solely from these features. Experiments on SPAQ and KONIQ demonstrate that our approach reduces prediction instability from 22.00% to 12.39% and achieves average gains of 0.3124/0.3507 in SRCC/PLCC across LIVE, CSIQ, SPAQ, and KONIQ compared to the baseline. Further analyses show that our method improves both stability and the reliability of the inference process.

Building Reasonable Inference for Vision-Language Models in Blind Image Quality Assessment

TL;DR

Vision-language BIQA models often produce contradictory reasoning and unstable quality predictions. The authors analyze how final quality scores relate to basic visual features, inspect middle-layer inference with a logit-lens approach, and introduce a two-stage tuning strategy that grounds quality inference in fundamental visual descriptions. The approach yields reduced prediction instability and consistent SRCC/PLCC gains across SPAQ, KONIQ, LIVE, and CSIQ datasets, while providing deeper insights into inference dynamics via attention and latent-output analyses. This work advances interpretable, human-aligned BIQA by separating visual perception from quality inference and demonstrating tangible improvements in robustness and reasoning coherence.

Abstract

Recent progress in BIQA has been driven by VLMs, whose semantic reasoning abilities suggest that they might extract visual features, generate descriptive text, and infer quality in a human-like manner. However, these models often produce textual descriptions that contradict their final quality predictions, and the predicted scores can change unstably during inference - behaviors not aligned with human reasoning. To understand these issues, we analyze the factors that cause contradictory assessments and instability. We first estimate the relationship between the final quality predictions and the generated visual features, finding that the predictions are not fully grounded in the features and that the logical connection between them is weak. Moreover, decoding intermediate VLM layers shows that the model frequently relies on a limited set of candidate tokens, which contributes to prediction instability. To encourage more human-like reasoning, we introduce a two-stage tuning method that explicitly separates visual perception from quality inference. In the first stage, the model learns visual features; in the second, it infers quality solely from these features. Experiments on SPAQ and KONIQ demonstrate that our approach reduces prediction instability from 22.00% to 12.39% and achieves average gains of 0.3124/0.3507 in SRCC/PLCC across LIVE, CSIQ, SPAQ, and KONIQ compared to the baseline. Further analyses show that our method improves both stability and the reliability of the inference process.

Paper Structure

This paper contains 24 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview. VLM-based BIQA model qinstruct exhibits the contradictory reasoning and generates inconsistent answers across repeated queries. Motivated by this observation, we investigate how the visual features contribute to the final quality predictions. In addition, we inspect the VLM reasoning processing by visualizing information dynamics within the decoder's intermediate layers. Based on our findings, we propose a two-stage tuning method that enables the model to produce more reasonable quality assessments while enhancing the stability of its predictions.
  • Figure 2: Both textual and visual inputs are projected into a shared semantic embedding space. The language and visual embeddings, denoted as ${L_{embed}, V_{embed}}$, each has 4096 dimensions. The generation of a single token requires processing through 32 decoder layers, where the model sequentially refines its predictions. The latent embeddings can be directly decoded to visualize the intermediate processing as we denoted as latent output. And the attention represents the relation between the current token and its previous context. In BIQA tasks, the context is basic visual features such as degradations or color.
  • Figure 3: Latent Output. Decoding the latent from intermediate layers of a BIQA model qinstruct, four most probable candidates of output token are visualized. In 30-th layer, the most probable token is "poor" while the output token is "good". It is reasonable in LLM system but does not fit the human reasoning. This image is from the CSIQ dataset csiq and has a quality score of 0.416 (where 1.0 represents the highest quality).
  • Figure 4:
  • Figure 5: Inference Visualization. This exhibits the inference of the $\langle quality\rangle$ token. The attention maps are computed and averaged across 720 samples, as described in Section \ref{['4.3.1']}. In the left panel, we present the one-stage tuning model qinstructmplug-owl2, where the input prompts include 65 image tokens. From token 71 onward, the generated tokens primarily describe the visual descriptions of the image. The right panel shows the inference process of the two-stage tuning model, where all tokens correspond to image quality descriptions. Compared to the one-stage tuning model qinstructmplug-owl2, which focuses on a limited set of visual embedding tokens, the two-stage tuning model exhibits a broader attention distribution across important description-related tokens.
  • ...and 1 more figures