Table of Contents
Fetching ...

Object-Centric Framework for Video Moment Retrieval

Zongyao Li, Yongkang Wong, Satoshi Yamazaki, Jianquan Liu, Mohan Kankanhalli

TL;DR

The paper addresses video moment retrieval for object-oriented queries by moving beyond frame-level representations to an object-centric paradigm. It constructs object and relationship tracklets via scene graphs and processes them with a relational tracklet transformer to capture fine-grained spatio-temporal dynamics. Empirically, the approach achieves state-of-the-art results on Charades-STA, QVHighlights, and TACoS, and ablations underscore the value of explicit object-level modeling and scene-graph quality. The work also discusses computational costs and potential improvements with faster open-vocabulary scene graph generation in future systems.

Abstract

Most existing video moment retrieval methods rely on temporal sequences of frame- or clip-level features that primarily encode global visual and semantic information. However, such representations often fail to capture fine-grained object semantics and appearance, which are crucial for localizing moments described by object-oriented queries involving specific entities and their interactions. In particular, temporal dynamics at the object level have been largely overlooked, limiting the effectiveness of existing approaches in scenarios requiring detailed object-level reasoning. To address this limitation, we propose a novel object-centric framework for moment retrieval. Our method first extracts query-relevant objects using a scene graph parser and then generates scene graphs from video frames to represent these objects and their relationships. Based on the scene graphs, we construct object-level feature sequences that encode rich visual and semantic information. These sequences are processed by a relational tracklet transformer, which models spatio-temporal correlations among objects over time. By explicitly capturing object-level state changes, our framework enables more accurate localization of moments aligned with object-oriented queries. We evaluated our method on three benchmarks: Charades-STA, QVHighlights, and TACoS. Experimental results demonstrate that our method outperforms existing state-of-the-art methods across all benchmarks.

Object-Centric Framework for Video Moment Retrieval

TL;DR

The paper addresses video moment retrieval for object-oriented queries by moving beyond frame-level representations to an object-centric paradigm. It constructs object and relationship tracklets via scene graphs and processes them with a relational tracklet transformer to capture fine-grained spatio-temporal dynamics. Empirically, the approach achieves state-of-the-art results on Charades-STA, QVHighlights, and TACoS, and ablations underscore the value of explicit object-level modeling and scene-graph quality. The work also discusses computational costs and potential improvements with faster open-vocabulary scene graph generation in future systems.

Abstract

Most existing video moment retrieval methods rely on temporal sequences of frame- or clip-level features that primarily encode global visual and semantic information. However, such representations often fail to capture fine-grained object semantics and appearance, which are crucial for localizing moments described by object-oriented queries involving specific entities and their interactions. In particular, temporal dynamics at the object level have been largely overlooked, limiting the effectiveness of existing approaches in scenarios requiring detailed object-level reasoning. To address this limitation, we propose a novel object-centric framework for moment retrieval. Our method first extracts query-relevant objects using a scene graph parser and then generates scene graphs from video frames to represent these objects and their relationships. Based on the scene graphs, we construct object-level feature sequences that encode rich visual and semantic information. These sequences are processed by a relational tracklet transformer, which models spatio-temporal correlations among objects over time. By explicitly capturing object-level state changes, our framework enables more accurate localization of moments aligned with object-oriented queries. We evaluated our method on three benchmarks: Charades-STA, QVHighlights, and TACoS. Experimental results demonstrate that our method outperforms existing state-of-the-art methods across all benchmarks.

Paper Structure

This paper contains 14 sections, 2 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Comparison between the previous frame-level approach and the proposed object-centric approach. The frame-level approach fails to capture subtle yet critical state changes—such as the opening or closing of a laptop—resulting in inaccurate temporal localization. In contrast, the proposed object-centric approach explicitly tracks the states and interactions of relevant objects over time, enabling more precise moment localization.
  • Figure 2: Overview of our object-centric moment retrieval framework. Given a video and a query, the framework first extracts query-relevant objects by parsing the query using a scene graph parser. It then constructs object-level feature sequences by embedding both visual and semantic information of these objects and their relationships into tracklets obtained from scene graph generation and object tracking. These feature sequences are concatenated with the query's textual feature and passed to a relational tracklet transformer, which models the spatio-temporal correlations among objects and relationships. The resulting representation enables accurate moment classification and localization by capturing query-relevant object state changes.
  • Figure 3: Comparison between three transformer block variants. The proposed variant (right) integrates a relational mask, derived from scene graph information, into the spatial self-attention mechanism, enabling the model to attend more effectively to semantically relevant object pairs.