BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos

Pilhyeon Lee; Hyeran Byun

BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos

Pilhyeon Lee, Hyeran Byun

TL;DR

BAM-DETR addresses center misalignment in temporal sentence grounding by introducing a boundary-aligned moment representation $(p, d_s, d_e)$, enabling direct boundary prediction from a salient anchor. It employs a dual-pathway decoder that separately refines the anchor via global attention and the boundaries via boundary-focused attention, coupled with a localization-oriented, quality-based scoring and Hungarian matching for end-to-end training. The approach yields state-of-the-art results on QVHighlights, Charades-STA, and TACoS, with robust performance under anti-biased conditions and improved boundary alignment as shown by boundary-hit metrics. The method’s boundary-centric design and quality-based ranking offer practical improvements for precise moment localization in videos, with code available at the authors’ GitHub repository.

Abstract

Temporal sentence grounding aims to localize moments relevant to a language description. Recently, DETR-like approaches achieved notable progress by predicting the center and length of a target moment. However, they suffer from the issue of center misalignment raised by the inherent ambiguity of moment centers, leading to inaccurate predictions. To remedy this problem, we propose a novel boundary-oriented moment formulation. In our paradigm, the model no longer needs to find the precise center but instead suffices to predict any anchor point within the interval, from which the boundaries are directly estimated. Based on this idea, we design a boundary-aligned moment detection transformer, equipped with a dual-pathway decoding process. Specifically, it refines the anchor and boundaries within parallel pathways using global and boundary-focused attention, respectively. This separate design allows the model to focus on desirable regions, enabling precise refinement of moment predictions. Further, we propose a quality-based ranking method, ensuring that proposals with high localization qualities are prioritized over incomplete ones. Experiments on three benchmarks validate the effectiveness of the proposed methods. The code is available at https://github.com/Pilhyeon/BAM-DETR.

BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos

TL;DR

BAM-DETR addresses center misalignment in temporal sentence grounding by introducing a boundary-aligned moment representation

, enabling direct boundary prediction from a salient anchor. It employs a dual-pathway decoder that separately refines the anchor via global attention and the boundaries via boundary-focused attention, coupled with a localization-oriented, quality-based scoring and Hungarian matching for end-to-end training. The approach yields state-of-the-art results on QVHighlights, Charades-STA, and TACoS, with robust performance under anti-biased conditions and improved boundary alignment as shown by boundary-hit metrics. The method’s boundary-centric design and quality-based ranking offer practical improvements for precise moment localization in videos, with code available at the authors’ GitHub repository.

Abstract

Paper Structure (28 sections, 13 equations, 10 figures, 10 tables)

This paper contains 28 sections, 13 equations, 10 figures, 10 tables.

Introduction
Related Works
Temporal Sentence Grounding in Videos
Detection Transformers
Method
Motivation.
Overview
Feature Extraction
Multimodal Encoder
Dual-pathway Decoder
Anchor updating pathway.
Boundary updating pathway.
Moment prediction.
Quality-based Scoring
Matching
...and 13 more sections

Figures (10)

Figure 1: Comparison of moment modeling approaches under the scenario of an ambiguous center from QVHighlights. (a) The conventional method formulates a moment with a tuple of ($c$, $l$). (b) In contrast, we propose to model it with a triplet of ($p$, $d_s$, $d_e$).
Figure 2: (a) Overview of the proposed BAM-DETR. (b) Details of the proposed dual-pathway decoding layer. It consists of two parallel pathways respectively for anchor and boundary updates, which refine previous moment predictions in a sequential manner.
Figure 3: Boundary-focused attention layer for starting queries.
Figure 4: Boundary hit rate
Figure 5: Visualization results
...and 5 more figures

BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos

TL;DR

Abstract

BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (10)