Table of Contents
Fetching ...

Reconstructive Visual Instruction Tuning

Haochen Wang, Anlin Zheng, Yucheng Zhao, Tiancai Wang, Zheng Ge, Xiangyu Zhang, Zhaoxiang Zhang

TL;DR

Reconstructive Visual Instruction Tuning (ROSS) introduces vision-centric supervision by reconstructing input images via a denoising-based latent-token objective, enabling LMMs to better preserve visual detail and reduce hallucinations. By exploring pixel vs latent targets and adopting a diffusion-based denoiser with a continuous tokenizer, Ross achieves robust improvements across benchmarks with a single visual encoder. The approach outperforms extrinsic expert ensembles in many settings and demonstrates transfer to depth perception tasks, suggesting strong generalization. Overall, Ross shows that intrinsic, reconstruction-based supervision can enhance fine-grained multimodal comprehension while remaining efficient and scalable.

Abstract

This paper introduces reconstructive visual instruction tuning (ROSS), a family of Large Multimodal Models (LMMs) that exploit vision-centric supervision signals. In contrast to conventional visual instruction tuning approaches that exclusively supervise text outputs, ROSS prompts LMMs to supervise visual outputs via reconstructing input images. By doing so, it capitalizes on the inherent richness and detail present within input images themselves, which are often lost in pure text supervision. However, producing meaningful feedback from natural images is challenging due to the heavy spatial redundancy of visual signals. To address this issue, ROSS employs a denoising objective to reconstruct latent representations of input images, avoiding directly regressing exact raw RGB values. This intrinsic activation design inherently encourages LMMs to maintain image detail, thereby enhancing their fine-grained comprehension capabilities and reducing hallucinations. Empirically, ROSS consistently brings significant improvements across different visual encoders and language models. In comparison with extrinsic assistance state-of-the-art alternatives that aggregate multiple visual experts, ROSS delivers competitive performance with a single SigLIP visual encoder, demonstrating the efficacy of our vision-centric supervision tailored for visual outputs.

Reconstructive Visual Instruction Tuning

TL;DR

Reconstructive Visual Instruction Tuning (ROSS) introduces vision-centric supervision by reconstructing input images via a denoising-based latent-token objective, enabling LMMs to better preserve visual detail and reduce hallucinations. By exploring pixel vs latent targets and adopting a diffusion-based denoiser with a continuous tokenizer, Ross achieves robust improvements across benchmarks with a single visual encoder. The approach outperforms extrinsic expert ensembles in many settings and demonstrates transfer to depth perception tasks, suggesting strong generalization. Overall, Ross shows that intrinsic, reconstruction-based supervision can enhance fine-grained multimodal comprehension while remaining efficient and scalable.

Abstract

This paper introduces reconstructive visual instruction tuning (ROSS), a family of Large Multimodal Models (LMMs) that exploit vision-centric supervision signals. In contrast to conventional visual instruction tuning approaches that exclusively supervise text outputs, ROSS prompts LMMs to supervise visual outputs via reconstructing input images. By doing so, it capitalizes on the inherent richness and detail present within input images themselves, which are often lost in pure text supervision. However, producing meaningful feedback from natural images is challenging due to the heavy spatial redundancy of visual signals. To address this issue, ROSS employs a denoising objective to reconstruct latent representations of input images, avoiding directly regressing exact raw RGB values. This intrinsic activation design inherently encourages LMMs to maintain image detail, thereby enhancing their fine-grained comprehension capabilities and reducing hallucinations. Empirically, ROSS consistently brings significant improvements across different visual encoders and language models. In comparison with extrinsic assistance state-of-the-art alternatives that aggregate multiple visual experts, ROSS delivers competitive performance with a single SigLIP visual encoder, demonstrating the efficacy of our vision-centric supervision tailored for visual outputs.

Paper Structure

This paper contains 20 sections, 8 equations, 13 figures, 14 tables.

Figures (13)

  • Figure 1: Conceptual comparison between different pipelines. (a) Typical visual instruction tuning approaches liu2023visualliu2024improved follow a LLM-centric design that solely leverage text supervision. (b) Aggregated visual instruction tuning alternatives tong2024cambriantong2024eyes leverages extrinsic assistance via combining several visual experts, requiring a careful selection of visual experts. (c) Our Ross, with a single visual encoder, e.g., CLIP radford2021learning and SigLIP zhai2023sigmoid, designs extra vision-centric reconstructive supervision as intrinsic activation. In this way, LMMs are required to preserve every detail of input images, thereby enhancing multimodal comprehension capabilities and reducing hallucinations.
  • Figure 2: Overview of Ross. Given an input image and the corresponding text to this image, Ross aims to supervise visual outputs by reconstruction.
  • Figure 3: Variants of Ross$^{\text{R}}$, where regression objectives are either computed on raw RGB values in (a) and (c), or specific latent space determined by $\mathcal{F}$ in (b). We adopt MSE as $\mathcal{M}$ for pixel regression in (a) and (c), and cosine-similarity for latent regression in (b), respectively.
  • Figure 4: Illustration of (a) the training procedure of Ross$^{\text{D}}$ and (b) the detailed architecture of the denoiser $\mathcal{J}_{\pi}$. (a) Ross$^{\text{D}}$ introduces visual guidance via denoising fine-grained visual tokens $\bm{z}_0$ conditioning on visual outputs $\bm{x}_{i \leq N}$. (b) The denoiser takes noisy tokens $\bm{z}_t$, current timesteps $t$, and conditions $\bm{x}_{i \leq N}$ as inputs and outputs the predicted noise $\hat{\epsilon}_t$. Each denoiser block consists of three linear projection layers and a standard self-attention block vaswani2017attention.
  • Figure 5: Pixel Regression v.s. Latent Regression. The teacher tokenizer $\mathcal{F}$ for Ross$^{\text{R}}$-Latent is the encoder of a continuous VAE kingma2013auto provided by rombach2022high, while its decoder serves as $\mathcal{F}^{-1}$ for Ross$^{\text{R}}$-Latent2Pixel. Our vision-centric reconstructive supervision surpasses the visual instruction tuning baseline in most cases. Among three regression variants, Ross$^{\text{R}}$-Latent performs the best, as it avoids explicitly regressing redundant raw RGB values.
  • ...and 8 more figures