Table of Contents
Fetching ...

Multi-granularity Correspondence Learning from Long-term Noisy Videos

Yijie Lin, Jie Zhang, Zhenyu Huang, Jia Liu, Zujie Wen, Xi Peng

TL;DR

This work tackles learning temporal correspondences for long-form instructional videos where misalignment between clips and captions occurs at coarse and fine granularities (MNC). It introduces Norton, a unified optimal transport framework that learns video-paragraph and clip-caption relationships via Sinkhorn-based transport with a soft-grmax-inspired fine-grained alignment and an Alignable Prompt Bucket to filter noise. Key innovations include a robust OT objective for long-range temporal learning, a soft-maximum operator to identify crucial words/frames, and a faulty-negative exploitation strategy to better utilize in-batch negatives. Extensive experiments across video-paragraph retrieval, text-to-video retrieval, VideoQA, and action segmentation demonstrate Norton’s effectiveness and computational efficiency, with ablations validating each component’s contribution. The approach substantially improves temporal understanding in long videos and offers a scalable, noise-robust paradigm for multi-modal learning with potential extension to additional modalities.

Abstract

Existing video-language studies mainly focus on learning short video clips, leaving long-term temporal dependencies rarely explored due to over-high computational cost of modeling long videos. To address this issue, one feasible solution is learning the correspondence between video clips and captions, which however inevitably encounters the multi-granularity noisy correspondence (MNC) problem. To be specific, MNC refers to the clip-caption misalignment (coarse-grained) and frame-word misalignment (fine-grained), hindering temporal learning and video understanding. In this paper, we propose NOise Robust Temporal Optimal traNsport (Norton) that addresses MNC in a unified optimal transport (OT) framework. In brief, Norton employs video-paragraph and clip-caption contrastive losses to capture long-term dependencies based on OT. To address coarse-grained misalignment in video-paragraph contrast, Norton filters out the irrelevant clips and captions through an alignable prompt bucket and realigns asynchronous clip-caption pairs based on transport distance. To address the fine-grained misalignment, Norton incorporates a soft-maximum operator to identify crucial words and key frames. Additionally, Norton exploits the potential faulty negative samples in clip-caption contrast by rectifying the alignment target with OT assignment to ensure precise temporal modeling. Extensive experiments on video retrieval, videoQA, and action segmentation verify the effectiveness of our method. Code is available at https://lin-yijie.github.io/projects/Norton.

Multi-granularity Correspondence Learning from Long-term Noisy Videos

TL;DR

This work tackles learning temporal correspondences for long-form instructional videos where misalignment between clips and captions occurs at coarse and fine granularities (MNC). It introduces Norton, a unified optimal transport framework that learns video-paragraph and clip-caption relationships via Sinkhorn-based transport with a soft-grmax-inspired fine-grained alignment and an Alignable Prompt Bucket to filter noise. Key innovations include a robust OT objective for long-range temporal learning, a soft-maximum operator to identify crucial words/frames, and a faulty-negative exploitation strategy to better utilize in-batch negatives. Extensive experiments across video-paragraph retrieval, text-to-video retrieval, VideoQA, and action segmentation demonstrate Norton’s effectiveness and computational efficiency, with ablations validating each component’s contribution. The approach substantially improves temporal understanding in long videos and offers a scalable, noise-robust paradigm for multi-modal learning with potential extension to additional modalities.

Abstract

Existing video-language studies mainly focus on learning short video clips, leaving long-term temporal dependencies rarely explored due to over-high computational cost of modeling long videos. To address this issue, one feasible solution is learning the correspondence between video clips and captions, which however inevitably encounters the multi-granularity noisy correspondence (MNC) problem. To be specific, MNC refers to the clip-caption misalignment (coarse-grained) and frame-word misalignment (fine-grained), hindering temporal learning and video understanding. In this paper, we propose NOise Robust Temporal Optimal traNsport (Norton) that addresses MNC in a unified optimal transport (OT) framework. In brief, Norton employs video-paragraph and clip-caption contrastive losses to capture long-term dependencies based on OT. To address coarse-grained misalignment in video-paragraph contrast, Norton filters out the irrelevant clips and captions through an alignable prompt bucket and realigns asynchronous clip-caption pairs based on transport distance. To address the fine-grained misalignment, Norton incorporates a soft-maximum operator to identify crucial words and key frames. Additionally, Norton exploits the potential faulty negative samples in clip-caption contrast by rectifying the alignment target with OT assignment to ensure precise temporal modeling. Extensive experiments on video retrieval, videoQA, and action segmentation verify the effectiveness of our method. Code is available at https://lin-yijie.github.io/projects/Norton.
Paper Structure (40 sections, 17 equations, 3 figures, 7 tables)

This paper contains 40 sections, 17 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Our observation on multi-granularity noisy correspondence (MNC) in video understanding. (Left) The green timeline denotes the alignable captions while the red timeline indicates the unalignable captions. The green text in $\mathbf{t}_5$ denotes partially correlated words w.r.t $\mathbf{v}_5$. (Right) The dashed line represents the original alignment according to timestamps and the red block indicates the misaligned clip-caption pair. The green block denotes the ground-truth alignment. The solid line denotes the re-alignment by Dynamic Time Warping muller2007dynamic which struggles to handle noisy correspondence well.
  • Figure 2: Overview of our multi-granularity correspondence learning. We perform video-paragraph contrastive learning to capture long-term temporal correlations from a fine-to-coarse perspective. Specifically, we first utilize the log-sum-exp operator on the frame-word similarity matrix to obtain fine-grained similarity between clip and caption. Additionally, we append an alignable prompt bucket on the clip-caption similarity matrix to filter out the irrelevant clips or captions. By applying Sinkhorn iterations on the clip-caption similarity matrix, we effectively tackle the asynchronous problem and obtain the optimal transport distance as the video-paragraph similarity.
  • Figure 3: Visualization of the re-alignment.