Table of Contents
Fetching ...

Moment Quantization for Video Temporal Grounding

Xiaolong Sun, Le Wang, Sanping Zhou, Liushuai Shi, Kun Xia, Mengnan Liu, Yabing Wang, Gang Hua

TL;DR

This work tackles video temporal grounding (VTG) by reframing moments as discrete vectors through Moment Quantization. MQVTG introduces a learnable moment codebook and two progressive implementations—clip quantization and moment quantization—along with a soft-quantization strategy to preserve visual diversity, and prior initialization plus joint projection to align codewords with temporal structure. The method is compatible with encoder-only and encoder-decoder VTG architectures and is trained with a composite loss including $L_{mr}$, $L_{hd}$, $L_{mq}$, and $L_{align}$, where $L_{mq} = L_{cb} + abla_{cmt} L_{cmt}$ and $L_{overall} = L_{mr} + abla_{hd} L_{hd} + abla_{mq} L_{mq} + abla_{align} L_{align}$. Extensive experiments on six benchmarks demonstrate state-of-the-art performance, with qualitative analyses showing improved foreground grouping and foreground-background separation. The approach is lightweight to integrate and exhibits strong generalizability across models and datasets, offering a practical plug-and-play solution to enhance VTG discrimination.

Abstract

Video temporal grounding is a critical video understanding task, which aims to localize moments relevant to a language description. The challenge of this task lies in distinguishing relevant and irrelevant moments. Previous methods focused on learning continuous features exhibit weak differentiation between foreground and background features. In this paper, we propose a novel Moment-Quantization based Video Temporal Grounding method (MQVTG), which quantizes the input video into various discrete vectors to enhance the discrimination between relevant and irrelevant moments. Specifically, MQVTG maintains a learnable moment codebook, where each video moment matches a codeword. Considering the visual diversity, i.e., various visual expressions for the same moment, MQVTG treats moment-codeword matching as a clustering process without using discrete vectors, avoiding the loss of useful information from direct hard quantization. Additionally, we employ effective prior-initialization and joint-projection strategies to enhance the maintained moment codebook. With its simple implementation, the proposed method can be integrated into existing temporal grounding models as a plug-and-play component. Extensive experiments on six popular benchmarks demonstrate the effectiveness and generalizability of MQVTG, significantly outperforming state-of-the-art methods. Further qualitative analysis shows that our method effectively groups relevant features and separates irrelevant ones, aligning with our goal of enhancing discrimination.

Moment Quantization for Video Temporal Grounding

TL;DR

This work tackles video temporal grounding (VTG) by reframing moments as discrete vectors through Moment Quantization. MQVTG introduces a learnable moment codebook and two progressive implementations—clip quantization and moment quantization—along with a soft-quantization strategy to preserve visual diversity, and prior initialization plus joint projection to align codewords with temporal structure. The method is compatible with encoder-only and encoder-decoder VTG architectures and is trained with a composite loss including , , , and , where and . Extensive experiments on six benchmarks demonstrate state-of-the-art performance, with qualitative analyses showing improved foreground grouping and foreground-background separation. The approach is lightweight to integrate and exhibits strong generalizability across models and datasets, offering a practical plug-and-play solution to enhance VTG discrimination.

Abstract

Video temporal grounding is a critical video understanding task, which aims to localize moments relevant to a language description. The challenge of this task lies in distinguishing relevant and irrelevant moments. Previous methods focused on learning continuous features exhibit weak differentiation between foreground and background features. In this paper, we propose a novel Moment-Quantization based Video Temporal Grounding method (MQVTG), which quantizes the input video into various discrete vectors to enhance the discrimination between relevant and irrelevant moments. Specifically, MQVTG maintains a learnable moment codebook, where each video moment matches a codeword. Considering the visual diversity, i.e., various visual expressions for the same moment, MQVTG treats moment-codeword matching as a clustering process without using discrete vectors, avoiding the loss of useful information from direct hard quantization. Additionally, we employ effective prior-initialization and joint-projection strategies to enhance the maintained moment codebook. With its simple implementation, the proposed method can be integrated into existing temporal grounding models as a plug-and-play component. Extensive experiments on six popular benchmarks demonstrate the effectiveness and generalizability of MQVTG, significantly outperforming state-of-the-art methods. Further qualitative analysis shows that our method effectively groups relevant features and separates irrelevant ones, aligning with our goal of enhancing discrimination.

Paper Structure

This paper contains 40 sections, 4 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: (a) Comparison of codebook vectors, foreground and background features between TR-DETR sun2024tr and our method for a given example. Compared to previous methods focused on learning continuous features, our method, aided by codebook vectors, achieves better foreground aggregation and foreground-background separation. (b) Visualization of moment quantization. The moment quantization discriminates foreground and background moments. The foregrounds are represented by red-related vectors, while the backgrounds are represented by other discrete vectors. The quantized features bring discriminative information to generate more accurate localization.
  • Figure 2: Comparison of three quantization methods including (a) the classic image quantization, (b) clip quantization that is a simple implementation of vector quantization for videos, and (c) the improved moment quantization for video temporal grounding. The design of the moment codebook used in moment quantization is shown in (d).
  • Figure 3: The architectures of MQVTG, including the encoder-only architecture and encoder-decoder (DETR) architecture.
  • Figure 4: Visulization of effective codebook vectors, foreground and background features in the latent space. (a), (b) and (c) are three successful cases, and (d) is a failure case. For better understanding, we provide both foreground and similar background frames for each example. With codebook assistance, our method performs strong foreground aggregation and fore/background separation across scenarios.
  • Figure 5: Evolution of effective codebook vectors during training.
  • ...and 1 more figures