Table of Contents
Fetching ...

Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression

Roy H. Jennings, Genady Paikin, Roy Shaul, Evgeny Soloveichik

TL;DR

This work challenges prevailing MLLM-fine-tuning strategies for image-based regression by showing that preset vocabularies and generic prompts offer no advantage over image-only training. It introduces Regression via Transformer-Based Classification (RvTC), a simple, scalable bin-based regression framework that converts regression into classification and benefits from increasing bin counts. Crucially, data-specific prompts containing image-relevant semantic information significantly unlock cross-modal reasoning, boosting performance beyond image-only baselines and achieving state-of-the-art results on AVA and AGIQA-3k across multiple backbones. The findings demonstrate robust semantic understanding in MLLMs for regression tasks and highlight the importance of semantic prompt design for practical cross-modal grounding.

Abstract

Multimodal Large Language Models (MLLMs) show promise for image-based regression tasks, but current approaches face key limitations. Recent methods fine-tune MLLMs using preset output vocabularies and generic task-level prompts (e.g., "How would you rate this image?"), assuming this mimics human rating behavior. Our analysis reveals that these approaches provide no benefit over image-only training. Models using preset vocabularies and generic prompts perform equivalently to image-only models, failing to leverage semantic understanding from textual input. We propose Regression via Transformer-Based Classification (RvTC), which replaces vocabulary-constrained classification with a flexible bin-based approach. Unlike approaches that address discretization errors through complex distributional modeling, RvTC eliminates manual vocabulary crafting through straightforward bin increase, achieving state-of-the-art performance on four image assessment datasets using only images. More importantly, we demonstrate that data-specific prompts dramatically improve performance. Unlike generic task descriptions, prompts containing semantic information about specific images enable MLLMs to leverage cross-modal understanding. On the AVA dataset, adding challenge titles to prompts substantially improves our already state-of-the-art image-only baseline. We demonstrate through empirical evidence from the AVA and AGIQA-3k datasets that MLLMs benefit from semantic prompt information, surpassing mere statistical biases. We validate RvTC across two different MLLM architectures, demonstrating consistent improvements and method generalizability.

Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression

TL;DR

This work challenges prevailing MLLM-fine-tuning strategies for image-based regression by showing that preset vocabularies and generic prompts offer no advantage over image-only training. It introduces Regression via Transformer-Based Classification (RvTC), a simple, scalable bin-based regression framework that converts regression into classification and benefits from increasing bin counts. Crucially, data-specific prompts containing image-relevant semantic information significantly unlock cross-modal reasoning, boosting performance beyond image-only baselines and achieving state-of-the-art results on AVA and AGIQA-3k across multiple backbones. The findings demonstrate robust semantic understanding in MLLMs for regression tasks and highlight the importance of semantic prompt design for practical cross-modal grounding.

Abstract

Multimodal Large Language Models (MLLMs) show promise for image-based regression tasks, but current approaches face key limitations. Recent methods fine-tune MLLMs using preset output vocabularies and generic task-level prompts (e.g., "How would you rate this image?"), assuming this mimics human rating behavior. Our analysis reveals that these approaches provide no benefit over image-only training. Models using preset vocabularies and generic prompts perform equivalently to image-only models, failing to leverage semantic understanding from textual input. We propose Regression via Transformer-Based Classification (RvTC), which replaces vocabulary-constrained classification with a flexible bin-based approach. Unlike approaches that address discretization errors through complex distributional modeling, RvTC eliminates manual vocabulary crafting through straightforward bin increase, achieving state-of-the-art performance on four image assessment datasets using only images. More importantly, we demonstrate that data-specific prompts dramatically improve performance. Unlike generic task descriptions, prompts containing semantic information about specific images enable MLLMs to leverage cross-modal understanding. On the AVA dataset, adding challenge titles to prompts substantially improves our already state-of-the-art image-only baseline. We demonstrate through empirical evidence from the AVA and AGIQA-3k datasets that MLLMs benefit from semantic prompt information, surpassing mere statistical biases. We validate RvTC across two different MLLM architectures, demonstrating consistent improvements and method generalizability.

Paper Structure

This paper contains 15 sections, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Existing MLLM methods using preset vocabulary and generic prompts (left) achieve 0.82 correlation on AVA. Our image-only RvTC model (center) exceeds this with state-of-the-art correlation of 0.83. Integrating data-specific prompts (e.g., "Outdoor Macro Shot") during fine-tuning (right) unlocks MLLM cross-modal reasoning, yielding a new state-of-the-art 0.90.
  • Figure 2: Effect of bin count on quantization noise, measured as (SRCC + PLCC)/2 between AVA ground truth mean opinion score values and their corresponding bin centers
  • Figure 3: Performance analysis of RvTC on AVA, comparing image-only RvTC (red) and incorporating image titles RvTC+ (blue) per-challenge average MOS predictions (top figure) and intra-challenge correlation of predictions (bottom figure) implying that the model is leveraging cross-modal features
  • Figure 4: Performance of image-only RvTC (top) and RvTC+ with challenge titles (bottom) on AVA using different number of bins and training lengths
  • Figure 5: Performance of RvTC fine-tuned on AGIQA-3k when evaluated with original prompt and with alternative prompt