Table of Contents
Fetching ...

Token Warping Helps MLLMs Look from Nearby Viewpoints

Phillip Y. Lee, Chanho Park, Mingue Park, Seungwoo Yoo, Juil Koo, Minhyuk Sung

Abstract

Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.

Token Warping Helps MLLMs Look from Nearby Viewpoints

Abstract

Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.

Paper Structure

This paper contains 48 sections, 14 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Viewpoint Change via Token Warping. We explore token warping as a means of enabling viewpoint changes for MLLMs and find that backward token warping can reliably transfer source image content to novel viewpoints without synthesizing new pixels.
  • Figure 2: Image Tokenization in MLLMs (Sec. \ref{['subsec:image_tokenization']}). MLLMs process images by dividing them into fixed-size patches, embedding each patch, and passing them through a vision encoder (e.g., ViT) to obtain image tokens.
  • Figure 3: Limitations of Pixel-Wise Warping. Pixel-wise warping to a target viewpoint often introduces local distortions and semantic degradation. In both forward (top) and backward (bottom) warping, the book from the source view appears significantly distorted after transformation (in the red box).
  • Figure 4: Pixel-Wise vs. Token Warping. Comparison of inverse warping strategies (Sec. \ref{['sec:token_warping']}). (A) Pixel-wise warping retrieves pixels for each target coordinate, but patchifying the warped image introduces local distortions, resulting in degraded MLLM understanding. (B) Token warping directly retrieves intact tokens (or patches) from the source view, preserving semantics and improving viewpoint-aware perception.
  • Figure 5: Fetching Position Noise Sensitivity (Sec. \ref{['sec:toy']}). Through a toy experiment on CV-Bench-2D tong2024cambrian, where we emulate local positional perturbations and degradation introduced by warping, we find that token representations in MLLMs are highly robust to noise in the image positions from which tokens are fetched. This suggests that tokens are well suited for representing viewpoint changes.
  • ...and 8 more figures