Table of Contents
Fetching ...

Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models

Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Pier Luigi Dovesi, Shaghayegh Roohi, Mark Granroth-Wilding, Rita Cucchiara

TL;DR

This work tackles the visual perception gap in multimodal large language models by injecting a self-supervised visual learning signal, I-JEPA, into the visual-language alignment pipeline (LLaVA). By freezing vision encoders as context and target providers and placing a shallow predictor within the LLM, JARVIS learns latent visual regularities beyond captions, via a masked predictive loss balanced with caption-based training. Across multiple LLMs and visual encoders, it achieves consistent gains on vision-centric benchmarks (notably CVBench3D) while preserving general cognitive capabilities, and benefits further from scaling target encoders. The approach demonstrates the value of self-supervised visual supervision for improving MLLM visual reasoning in a resource-efficient, plug-in manner suitable for existing training pipelines.

Abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental visual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understanding primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMs to overfit language priors while overlooking visual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language alignment pipeline of MLLMs training. Our approach leverages frozen vision foundation models as context and target encoders, while training the predictor, implemented as the early layers of an LLM, to learn structural and semantic regularities from images without relying exclusively on language supervision. Extensive experiments on standard MLLM benchmarks show that JARVIS consistently improves performance on vision-centric benchmarks across different LLM families, without degrading multimodal reasoning abilities. Our source code is publicly available at: https://github.com/aimagelab/JARVIS.

Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models

TL;DR

This work tackles the visual perception gap in multimodal large language models by injecting a self-supervised visual learning signal, I-JEPA, into the visual-language alignment pipeline (LLaVA). By freezing vision encoders as context and target providers and placing a shallow predictor within the LLM, JARVIS learns latent visual regularities beyond captions, via a masked predictive loss balanced with caption-based training. Across multiple LLMs and visual encoders, it achieves consistent gains on vision-centric benchmarks (notably CVBench3D) while preserving general cognitive capabilities, and benefits further from scaling target encoders. The approach demonstrates the value of self-supervised visual supervision for improving MLLM visual reasoning in a resource-efficient, plug-in manner suitable for existing training pipelines.

Abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental visual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understanding primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMs to overfit language priors while overlooking visual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language alignment pipeline of MLLMs training. Our approach leverages frozen vision foundation models as context and target encoders, while training the predictor, implemented as the early layers of an LLM, to learn structural and semantic regularities from images without relying exclusively on language supervision. Extensive experiments on standard MLLM benchmarks show that JARVIS consistently improves performance on vision-centric benchmarks across different LLM families, without degrading multimodal reasoning abilities. Our source code is publicly available at: https://github.com/aimagelab/JARVIS.

Paper Structure

This paper contains 17 sections, 6 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Comparison of LLaVA (top-left), a baseline that aligns the output of a selected layer with the output of a target encoder (top-right), and JARVIS that align the outputs employing a masked predictive loss (bottom-left). We also report the results of JARVIS and LLaVA across three vision benchmarks (bottom-right).
  • Figure 2: Overview of our JARVIS method. We leverage a single context block to predict the representations of multiple target blocks through a masked predictive loss, aligning the predicted embeddings of the LLM with the outputs of a target encoder.
  • Figure 3: Visualization of the attention mask implementation. Left: pseudo-code outlining the sequential steps used to compute the attention mask. Right: graphical illustration showing the effect of each step, highlighting how the mask modifies token attention.
  • Figure 4: Qualitative comparison of three training methods for MLLMs with Qwen2-7B team2024qwen2. We show samples from CVBench2D tong2024cambrian (Count and Relative, 1st and 2nd columns), CVBench3D tong2024cambrian (Depth and Distance, 3rd column), and Blink fu2024blink (last column).
  • Figure 5: Effect of scaling the target visual encoder on Vision-Centric benchmarks.
  • ...and 3 more figures