Building Reasonable Inference for Vision-Language Models in Blind Image Quality Assessment
Yuan Li, Zitang Sun, Yen-ju Chen, Shin'ya Nishida
TL;DR
Vision-language BIQA models often produce contradictory reasoning and unstable quality predictions. The authors analyze how final quality scores relate to basic visual features, inspect middle-layer inference with a logit-lens approach, and introduce a two-stage tuning strategy that grounds quality inference in fundamental visual descriptions. The approach yields reduced prediction instability and consistent SRCC/PLCC gains across SPAQ, KONIQ, LIVE, and CSIQ datasets, while providing deeper insights into inference dynamics via attention and latent-output analyses. This work advances interpretable, human-aligned BIQA by separating visual perception from quality inference and demonstrating tangible improvements in robustness and reasoning coherence.
Abstract
Recent progress in BIQA has been driven by VLMs, whose semantic reasoning abilities suggest that they might extract visual features, generate descriptive text, and infer quality in a human-like manner. However, these models often produce textual descriptions that contradict their final quality predictions, and the predicted scores can change unstably during inference - behaviors not aligned with human reasoning. To understand these issues, we analyze the factors that cause contradictory assessments and instability. We first estimate the relationship between the final quality predictions and the generated visual features, finding that the predictions are not fully grounded in the features and that the logical connection between them is weak. Moreover, decoding intermediate VLM layers shows that the model frequently relies on a limited set of candidate tokens, which contributes to prediction instability. To encourage more human-like reasoning, we introduce a two-stage tuning method that explicitly separates visual perception from quality inference. In the first stage, the model learns visual features; in the second, it infers quality solely from these features. Experiments on SPAQ and KONIQ demonstrate that our approach reduces prediction instability from 22.00% to 12.39% and achieves average gains of 0.3124/0.3507 in SRCC/PLCC across LIVE, CSIQ, SPAQ, and KONIQ compared to the baseline. Further analyses show that our method improves both stability and the reliability of the inference process.
