Table of Contents
Fetching ...

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

Miran Heo, Min-Hung Chen, De-An Huang, Sifei Liu, Subhashree Radhakrishnan, Seon Joo Kim, Yu-Chiang Frank Wang, Ryo Hachiuma

TL;DR

Omni-RGPT presents a unified framework for region-level understanding in both images and videos by introducing Token Mark, a fixed set of region tokens embedded into visual regions and text prompts to maintain consistent region references across frames. A Temporal Region Guide Head further stabilizes region interpretation in videos without relying on tracklets. The authors build RegVID-300k, a large-scale region-level video instruction dataset generated with GPT-4o-assisted captioning and hallucination mitigation to support learning. Empirical results show state-of-the-art performance on image-based Visual Commonsense Reasoning and video-based Causal-VidQA, along with strong captioning and region localization capabilities. The approach offers scalable, robust region-level reasoning with a practical data pipeline and demonstrates broad applicability to region-centric visual reasoning tasks.

Abstract

We present Omni-RGPT, a multimodal large language model designed to facilitate region-level comprehension for both images and videos. To achieve consistent region representation across spatio-temporal dimensions, we introduce Token Mark, a set of tokens highlighting the target regions within the visual feature space. These tokens are directly embedded into spatial regions using region prompts (e.g., boxes or masks) and simultaneously incorporated into the text prompt to specify the target, establishing a direct connection between visual and text tokens. To further support robust video understanding without requiring tracklets, we introduce an auxiliary task that guides Token Mark by leveraging the consistency of the tokens, enabling stable region interpretation across the video. Additionally, we introduce a large-scale region-level video instruction dataset (RegVID-300k). Omni-RGPT achieves state-of-the-art results on image and video-based commonsense reasoning benchmarks while showing strong performance in captioning and referring expression comprehension tasks.

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

TL;DR

Omni-RGPT presents a unified framework for region-level understanding in both images and videos by introducing Token Mark, a fixed set of region tokens embedded into visual regions and text prompts to maintain consistent region references across frames. A Temporal Region Guide Head further stabilizes region interpretation in videos without relying on tracklets. The authors build RegVID-300k, a large-scale region-level video instruction dataset generated with GPT-4o-assisted captioning and hallucination mitigation to support learning. Empirical results show state-of-the-art performance on image-based Visual Commonsense Reasoning and video-based Causal-VidQA, along with strong captioning and region localization capabilities. The approach offers scalable, robust region-level reasoning with a practical data pipeline and demonstrates broad applicability to region-centric visual reasoning tasks.

Abstract

We present Omni-RGPT, a multimodal large language model designed to facilitate region-level comprehension for both images and videos. To achieve consistent region representation across spatio-temporal dimensions, we introduce Token Mark, a set of tokens highlighting the target regions within the visual feature space. These tokens are directly embedded into spatial regions using region prompts (e.g., boxes or masks) and simultaneously incorporated into the text prompt to specify the target, establishing a direct connection between visual and text tokens. To further support robust video understanding without requiring tracklets, we introduce an auxiliary task that guides Token Mark by leveraging the consistency of the tokens, enabling stable region interpretation across the video. Additionally, we introduce a large-scale region-level video instruction dataset (RegVID-300k). Omni-RGPT achieves state-of-the-art results on image and video-based commonsense reasoning benchmarks while showing strong performance in captioning and referring expression comprehension tasks.
Paper Structure (35 sections, 2 equations, 6 figures, 33 tables)

This paper contains 35 sections, 2 equations, 6 figures, 33 tables.

Figures (6)

  • Figure 1: Representative demo examples of Omni-RGPT. We introduce a unified multimodal large language model that integrates region-level understanding for both images and videos. Given user-defined localized region inputs (boxes or masks) accompanied by a corresponding text prompt, Omni-RGPT generates responses tailored to the visual context of each region for both images and videos.
  • Figure 2: Method comparison. (a) RoI-based methods generate visual region prompts using RoI-aligned visual features, potentially leading to temporal drift in the visual features of the target object in the video domain. (b) In contrast, our Token Mark is assigned to the corresponding region, preserving a consistent spatio-temporal target reference.
  • Figure 3: (a) Overview: Omni-RGPT enables region-level understanding across image and video inputs. Given region prompts (e.g. boxes or masks) in a single image or the initial frame of a video, we assign Token Mark --- a set of vectors serving as spatio-temporal region indicators --- to the region. These vectors are embedded into the spatial region localized by the region prompt and directly injected into both visual and text prompts to indicate the target. (b) Auxiliary Head: We further introduce Temporal Region Guide Head to achieve robust region understanding in videos without relying on tracklet prompts. Building on Token Mark's consistent representation of target objects across frames, this auxiliary task classifies the target Token Mark for visual tokens in subsequent frames.
  • Figure 4: Overview of our instruction sample generation pipeline. From a video with region masklets and nouns, the region-level captions, which contain contextual and temporal information about regions, are generated from GPT4o (left). Then, the hallucinations in the captions are mitigated (middle). Lastly, the instruction samples that cover diverse aspects of the regions are generated (right).
  • Figure 5: Heatmap of Temporal Region Guide Head outputs.
  • ...and 1 more figures