Table of Contents
Fetching ...

Artemis: Towards Referential Understanding in Complex Videos

Jihao Qiu, Yuan Zhang, Xi Tang, Lingxi Xie, Tianren Ma, Pengyu Yan, David Doermann, Qixiang Ye, Yunjie Tian

TL;DR

The paper addresses the challenge of video-based referential understanding by introducing Artemis, a multimodal language model baseline that learns fine-grained, target-specific video representations through RoI tracking and selection. It constructs VideoRef45K, a 45K QA-paired benchmark, and trains Artemis via a three-stage pipeline that progressively aligns video features with language cues. The approach demonstrates strong quantitative performance and qualitative descriptiveness, outperforming several image-based and multi-frame baselines and serving as a building block for complex tasks like grounding and long-video summarization. This work advances fine-grained, interactive video understanding and offers a practical, scalable framework for integrating video reasoning with existing grounding and summarization tools.

Abstract

Videos carry rich visual information including object description, action, interaction, etc., but the existing multimodal large language models (MLLMs) fell short in referential understanding scenarios such as video-based referring. In this paper, we present Artemis, an MLLM that pushes video-based referential understanding to a finer level. Given a video, Artemis receives a natural-language question with a bounding box in any video frame and describes the referred target in the entire video. The key to achieving this goal lies in extracting compact, target-specific video features, where we set a solid baseline by tracking and selecting spatiotemporal features from the video. We train Artemis on the newly established VideoRef45K dataset with 45K video-QA pairs and design a computationally efficient, three-stage training procedure. Results are promising both quantitatively and qualitatively. Additionally, we show that \model can be integrated with video grounding and text summarization tools to understand more complex scenarios. Code and data are available at https://github.com/qiujihao19/Artemis.

Artemis: Towards Referential Understanding in Complex Videos

TL;DR

The paper addresses the challenge of video-based referential understanding by introducing Artemis, a multimodal language model baseline that learns fine-grained, target-specific video representations through RoI tracking and selection. It constructs VideoRef45K, a 45K QA-paired benchmark, and trains Artemis via a three-stage pipeline that progressively aligns video features with language cues. The approach demonstrates strong quantitative performance and qualitative descriptiveness, outperforming several image-based and multi-frame baselines and serving as a building block for complex tasks like grounding and long-video summarization. This work advances fine-grained, interactive video understanding and offers a practical, scalable framework for integrating video reasoning with existing grounding and summarization tools.

Abstract

Videos carry rich visual information including object description, action, interaction, etc., but the existing multimodal large language models (MLLMs) fell short in referential understanding scenarios such as video-based referring. In this paper, we present Artemis, an MLLM that pushes video-based referential understanding to a finer level. Given a video, Artemis receives a natural-language question with a bounding box in any video frame and describes the referred target in the entire video. The key to achieving this goal lies in extracting compact, target-specific video features, where we set a solid baseline by tracking and selecting spatiotemporal features from the video. We train Artemis on the newly established VideoRef45K dataset with 45K video-QA pairs and design a computationally efficient, three-stage training procedure. Results are promising both quantitatively and qualitatively. Additionally, we show that \model can be integrated with video grounding and text summarization tools to understand more complex scenarios. Code and data are available at https://github.com/qiujihao19/Artemis.
Paper Structure (16 sections, 14 figures, 6 tables)

This paper contains 16 sections, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Artemis' ability in video-based dialogue. Notably, Artemis excels particularly in video-based referring, outperforming the existing MLLMs including Merlin yu2023merlinempowering and Video-LLaVA lin2023videollava lacking comprehensiveness and Osprey yuan2024osprey suffering hallucination.
  • Figure 2: Left: the overall framework of Artemis, where an MLLM receives a text prompt together with spatial, temporal, and target-specific video features, and produces the answer. Right: the RoI tracking and selection mechanism to generate target-specific features. We use different IDs to show the clustering result. This figure is best viewed in color.
  • Figure 3: Artemis and Merlin for video-based referring. Note that Merlin needs the semantic class of <region> to be provided while Artemis does not. In each case, the orange rectangle indicates the input <region>, blue rectangles are the tracked RoIs, and yellow stars label the selected RoIs. Red and green texts indicate incorrect and correct answers, respectively. This figure is best viewed in color.
  • Figure 4: RoI manipulation increases the informativeness and diversity of RoIs. See Appendix \ref{['sec:append_D']} for details.
  • Figure 5: How RoI tracking and selection gradually improves the quality of video-based referring. In each example, the orange rectangle indicates the input <region>, blue rectangles are the tracked RoIs, and green and yellow stars label the uniformly sampled and K-means selected RoIs, respectively. Red and green texts highlight the incorrect and correct outputs. This figure is best viewed in color.
  • ...and 9 more figures