Table of Contents
Fetching ...

Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G. Shapiro, Ranjay Krishna

TL;DR

<3-5 sentence high-level summary> Perception Tokens introduce intrinsic visual representations as auxiliary reasoning steps for multimodal language models, enabling depth estimation and bounding-box reasoning without relying on external tools. The Aurora training framework tokenizes depth maps via a VQVAE and encodes bounding boxes as structured tokens, then distills these representations into the MLM's reasoning process through multi-task curriculum learning. Empirical results show state-of-the-art performance on 3D relative depth (BLINK and HardBLINK) and 2D counting (CV-Bench, SEED-Bench, BLINK), with notable generalization across tasks. The approach improves interpretability and scalability of visual reasoning in MLMs, offering a path toward broader perceptual capabilities without extensive finetuning or auxiliary modules.

Abstract

Multimodal language models (MLMs) still face challenges in fundamental visual perception tasks where specialized models excel. Tasks requiring reasoning about 3D structures benefit from depth estimation, and reasoning about 2D object instances benefits from object detection. Yet, MLMs can not produce intermediate depth or boxes to reason over. Finetuning MLMs on relevant data doesn't generalize well and outsourcing computation to specialized vision tools is too compute-intensive and memory-inefficient. To address this, we introduce Perception Tokens, intrinsic image representations designed to assist reasoning tasks where language is insufficient. Perception tokens act as auxiliary reasoning tokens, akin to chain-of-thought prompts in language models. For example, in a depth-related task, an MLM augmented with perception tokens can reason by generating a depth map as tokens, enabling it to solve the problem effectively. We propose AURORA, a training method that augments MLMs with perception tokens for improved reasoning over visual inputs. AURORA leverages a VQVAE to transform intermediate image representations, such as depth maps into a tokenized format and bounding box tokens, which is then used in a multi-task training framework. AURORA achieves notable improvements across counting benchmarks: +10.8% on BLINK, +11.3% on CVBench, and +8.3% on SEED-Bench, outperforming finetuning approaches in generalization across datasets. It also improves on relative depth: over +6% on BLINK. With perception tokens, AURORA expands the scope of MLMs beyond language-based reasoning, paving the way for more effective visual reasoning capabilities.

Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

TL;DR

<3-5 sentence high-level summary> Perception Tokens introduce intrinsic visual representations as auxiliary reasoning steps for multimodal language models, enabling depth estimation and bounding-box reasoning without relying on external tools. The Aurora training framework tokenizes depth maps via a VQVAE and encodes bounding boxes as structured tokens, then distills these representations into the MLM's reasoning process through multi-task curriculum learning. Empirical results show state-of-the-art performance on 3D relative depth (BLINK and HardBLINK) and 2D counting (CV-Bench, SEED-Bench, BLINK), with notable generalization across tasks. The approach improves interpretability and scalability of visual reasoning in MLMs, offering a path toward broader perceptual capabilities without extensive finetuning or auxiliary modules.

Abstract

Multimodal language models (MLMs) still face challenges in fundamental visual perception tasks where specialized models excel. Tasks requiring reasoning about 3D structures benefit from depth estimation, and reasoning about 2D object instances benefits from object detection. Yet, MLMs can not produce intermediate depth or boxes to reason over. Finetuning MLMs on relevant data doesn't generalize well and outsourcing computation to specialized vision tools is too compute-intensive and memory-inefficient. To address this, we introduce Perception Tokens, intrinsic image representations designed to assist reasoning tasks where language is insufficient. Perception tokens act as auxiliary reasoning tokens, akin to chain-of-thought prompts in language models. For example, in a depth-related task, an MLM augmented with perception tokens can reason by generating a depth map as tokens, enabling it to solve the problem effectively. We propose AURORA, a training method that augments MLMs with perception tokens for improved reasoning over visual inputs. AURORA leverages a VQVAE to transform intermediate image representations, such as depth maps into a tokenized format and bounding box tokens, which is then used in a multi-task training framework. AURORA achieves notable improvements across counting benchmarks: +10.8% on BLINK, +11.3% on CVBench, and +8.3% on SEED-Bench, outperforming finetuning approaches in generalization across datasets. It also improves on relative depth: over +6% on BLINK. With perception tokens, AURORA expands the scope of MLMs beyond language-based reasoning, paving the way for more effective visual reasoning capabilities.

Paper Structure

This paper contains 42 sections, 4 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: We introduce Perception Tokens, intermediate reasoning tokens that allow MLMs to go beyond using language in reasoning. With it, we develop Aurora, a framework that trains multimodal language models to leverage visual perception tokens, allowing them to use depth estimation and bounding box predictions while reasoning.
  • Figure 2: We demonstrate relative depth estimation and counting questions where LLaVA fails. In contrast, by learning to utilize visual perception tokens as intermediate reasoning steps, LLaVA-Aurora successfully complete these tasks requiring perceptual understanding.
  • Figure 3: The overall Aurora training framework. We first learn visual perception tokens using VQVAE. We then finetune MLMs with a multi-task training approach where we distill intrinsic image representations (e.g., depth map) into MLMs by training them to decode the visual tokens as intermediate reasoning steps towards completing the tasks.
  • Figure 4: Depth maps generated by Aurora are imperfect but resemble the ground-truths from Depth Anything yang2024depth.
  • Figure 5: Qualitative comparison of predicted depth maps with and without reconstruction loss.
  • ...and 2 more figures