Table of Contents
Fetching ...

OMEGA: Optimized Multimodal Position Encoding Index Derivation with Global Adaptive Scaling for Vision-Language Models

Ruoxiang Huang, Xindian Ma, Rundong Kong, Zhen Yuan, Peng Zhang

TL;DR

OMEGA addresses the misalignment between textual continuity and visual spatial structure in vision-language models by introducing Modality-Specific Position Encoding (MSPE) and Global Adaptive Encoding Step Scaling (GAESS). MSPE preserves separate coordinate dimensions for text and visuals using placeholders, maintaining sequence integrity and 2D visual geometry, while GAESS aligns information density across modalities via embedding-entropy–based scaling of visual position steps. The approach yields consistent improvements across multiple backbones and VQA benchmarks in both zero-shot and fine-tuned settings, outperforming modality-unified position encoding strategies with minimal architectural changes. This work provides a low-cost, generalizable method for enhancing cross-modal alignment in VLMs and suggests extensions to video-language modeling and other modalities.

Abstract

Vision-Language Models (VLMs) have demonstrated strong performance across various multimodal tasks, where position encoding plays a vital role in modeling both the sequential structure of textual information and the spatial structure of visual information. However, current VLMs commonly adopt modality-unified 1D or 2D positional indexing strategies, which treat textual and visual tokens uniformly without accounting for their distinct structural properties and sequential continuity for text and spatial coherence for vision. To address this limitation, we propose OMEGA, a novel position encoding framework that employs Modality-Specific Position Encoding (MSPE) to assign positional indices while preserving the inherent structures of each modality across separate coordinate dimensions. Additionally, to align the information density of multimodal data in the positional index space, OMEGA introduces Global Adaptive Encoding Step Scaling (GAESS), which adaptively adjusts the position encoding step size of visual tokens based on the embedding entropy of both modalities. Experimental results demonstrate that OMEGA consistently enhances VLM performance across diverse architectures and VQA benchmarks. On visual-intensive tasks, OMEGA achieves up to 3.43% improvement over baseline position encoding strategies on Qwen2.5-VL-3B, with consistent gains observed across larger models including Qwen2.5-VL-7B and LLaVA-v1.5-7B.

OMEGA: Optimized Multimodal Position Encoding Index Derivation with Global Adaptive Scaling for Vision-Language Models

TL;DR

OMEGA addresses the misalignment between textual continuity and visual spatial structure in vision-language models by introducing Modality-Specific Position Encoding (MSPE) and Global Adaptive Encoding Step Scaling (GAESS). MSPE preserves separate coordinate dimensions for text and visuals using placeholders, maintaining sequence integrity and 2D visual geometry, while GAESS aligns information density across modalities via embedding-entropy–based scaling of visual position steps. The approach yields consistent improvements across multiple backbones and VQA benchmarks in both zero-shot and fine-tuned settings, outperforming modality-unified position encoding strategies with minimal architectural changes. This work provides a low-cost, generalizable method for enhancing cross-modal alignment in VLMs and suggests extensions to video-language modeling and other modalities.

Abstract

Vision-Language Models (VLMs) have demonstrated strong performance across various multimodal tasks, where position encoding plays a vital role in modeling both the sequential structure of textual information and the spatial structure of visual information. However, current VLMs commonly adopt modality-unified 1D or 2D positional indexing strategies, which treat textual and visual tokens uniformly without accounting for their distinct structural properties and sequential continuity for text and spatial coherence for vision. To address this limitation, we propose OMEGA, a novel position encoding framework that employs Modality-Specific Position Encoding (MSPE) to assign positional indices while preserving the inherent structures of each modality across separate coordinate dimensions. Additionally, to align the information density of multimodal data in the positional index space, OMEGA introduces Global Adaptive Encoding Step Scaling (GAESS), which adaptively adjusts the position encoding step size of visual tokens based on the embedding entropy of both modalities. Experimental results demonstrate that OMEGA consistently enhances VLM performance across diverse architectures and VQA benchmarks. On visual-intensive tasks, OMEGA achieves up to 3.43% improvement over baseline position encoding strategies on Qwen2.5-VL-3B, with consistent gains observed across larger models including Qwen2.5-VL-7B and LLaVA-v1.5-7B.

Paper Structure

This paper contains 25 sections, 13 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Empirical analysis of sequential continuity and spatial structure disruption. Sequential continuity is disrupted by randomly inserting positional index gaps into the text sequence, with each gap sized to match the number of visual tokens per image. Spatial structure is disrupted by randomly shuffling the positional indices of a proportion of visual tokens during index derivation. Left: The relationship between the number of Visual Gaps and the accuracy of QwenVL2.5-VL-3B on ScienceQA. Right: The relationship between the proportion of shuffled visual tokens and the accuracy of QwenVL2.5-VL-3B on MMBench.
  • Figure 2: Illustration of Modality-Specific Position Encoding (MSPE) compared with Modality-Unified 1D-PE and 2D-PE.
  • Figure 3: Illustration of Global Adaptive Encoding Step Scaling (GAESS) under Modality-Specific Position Encoding
  • Figure 4: Illustration of Modality-Independent Position Encoding, which disrupts the spatial relationships between cross-modal tokens.
  • Figure 5: Examples of computing information entropy and $\gamma$ across different datasets.