Table of Contents
Fetching ...

ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models

Bozhou Li, Wentao Zhang

TL;DR

The paper analyzes the bottlenecks of dynamic high-resolution adaptation in vision–language models caused by RoPE's long-range decay and token-token distance growth. It introduces ID-Align, a position ID remapping technique that assigns high-resolution image tokens the same IDs as their corresponding thumbnails, with a interpolation-based mapping to prevent runaway ID magnitudes. The authors provide theoretical and empirical analysis of RoPE's long-range behavior and demonstrate that ID-Align improves cross-resolution correspondence and text–image interaction, achieving notable gains on benchmarks such as MMBench. This approach enhances fine-grained perception and global-context integration in multimodal reasoning, offering a practical method to bolster VLMs under dynamic high-resolution regimes.

Abstract

Currently, a prevalent approach for enhancing Vision-Language Models (VLMs) performance is to encode both the high-resolution version and the thumbnail of an image simultaneously. While effective, this method generates a large number of image tokens. When combined with the widely used Rotary Position Embedding (RoPE), its long-term decay property hinders the interaction between high-resolution tokens and thumbnail tokens, as well as between text and image. To address these issues, we propose ID-Align, which alleviates these problems by reordering position IDs. In this method, high-resolution tokens inherit IDs from their corresponding thumbnail token while constraining the overexpansion of positional indices. Our experiments conducted within the LLaVA-Next framework demonstrate that ID-Align achieves significant improvements, including a 6.09% enhancement on MMBench's relation reasoning tasks and notable gains across multiple benchmarks. Our code is available at the following link: https://github.com/zooblastlbz/ID-Align.

ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models

TL;DR

The paper analyzes the bottlenecks of dynamic high-resolution adaptation in vision–language models caused by RoPE's long-range decay and token-token distance growth. It introduces ID-Align, a position ID remapping technique that assigns high-resolution image tokens the same IDs as their corresponding thumbnails, with a interpolation-based mapping to prevent runaway ID magnitudes. The authors provide theoretical and empirical analysis of RoPE's long-range behavior and demonstrate that ID-Align improves cross-resolution correspondence and text–image interaction, achieving notable gains on benchmarks such as MMBench. This approach enhances fine-grained perception and global-context integration in multimodal reasoning, offering a practical method to bolster VLMs under dynamic high-resolution regimes.

Abstract

Currently, a prevalent approach for enhancing Vision-Language Models (VLMs) performance is to encode both the high-resolution version and the thumbnail of an image simultaneously. While effective, this method generates a large number of image tokens. When combined with the widely used Rotary Position Embedding (RoPE), its long-term decay property hinders the interaction between high-resolution tokens and thumbnail tokens, as well as between text and image. To address these issues, we propose ID-Align, which alleviates these problems by reordering position IDs. In this method, high-resolution tokens inherit IDs from their corresponding thumbnail token while constraining the overexpansion of positional indices. Our experiments conducted within the LLaVA-Next framework demonstrate that ID-Align achieves significant improvements, including a 6.09% enhancement on MMBench's relation reasoning tasks and notable gains across multiple benchmarks. Our code is available at the following link: https://github.com/zooblastlbz/ID-Align.

Paper Structure

This paper contains 27 sections, 24 equations, 13 figures, 3 tables, 1 algorithm.

Figures (13)

  • Figure 1: Intuitive presentation of the original high-resolution method and ID-Align.
  • Figure 2: Flowchart of the Dynamic High-Resolution Method
  • Figure 3: The Long-term Decay Property of RoPE. We randomly sampled 100 text data points from Wikitext and randomly selected 10 pairs of q-k from each layer of the Vicuna-7B model for computation.
  • Figure 4: Attention distributions from the red region in the high-resolution image and the red text towards thumbnail tokens. Figure \ref{['fig:4a']} shows the data example. Figures \ref{['fig:4b']} and \ref{['fig:4c']} depict the attention distribution from the red region, and figures \ref{['fig:4d']} and \ref{['fig:4e']} show the attention distribution from the red text.
  • Figure 5: Simulation of RoPE's Long-term Properties under Different $\mathbf{\mu}_q$ and $\mathbf{\mu}_k$
  • ...and 8 more figures