Table of Contents
Fetching ...

Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval

Jiamian Wang, Guohao Sun, Pichao Wang, Dongfang Liu, Sohail Dianat, Majid Rabbani, Raghuveer Rao, Zhiqiang Tao

TL;DR

This work tackles the mismatch between concise text and richly described videos in text-video retrieval by reframing text as a stochastic embedding, or text mass, rather than a single point. It introduces T-MASS, combining a learnable similarity-aware radius and a regularization using a support text vector to broaden the semantic coverage of text in the joint embedding space. The method trains with a stochastic loss on sampled text masses and performs inference by sampling multiple masses and selecting the best match, yielding improved alignment and robustness. Across five benchmarks, T-MASS achieves state-of-the-art results and substantial gains over baselines, demonstrating the value of richer, uncertainty-aware text representations for multimodal retrieval.

Abstract

The increasing prevalence of video clips has sparked growing interest in text-video retrieval. Recent advances focus on establishing a joint embedding space for text and video, relying on consistent embedding representations to compute similarity. However, the text content in existing datasets is generally short and concise, making it hard to fully describe the redundant semantics of a video. Correspondingly, a single text embedding may be less expressive to capture the video embedding and empower the retrieval. In this study, we propose a new stochastic text modeling method T-MASS, i.e., text is modeled as a stochastic embedding, to enrich text embedding with a flexible and resilient semantic range, yielding a text mass. To be specific, we introduce a similarity-aware radius module to adapt the scale of the text mass upon the given text-video pairs. Plus, we design and develop a support text regularization to further control the text mass during the training. The inference pipeline is also tailored to fully exploit the text mass for accurate retrieval. Empirical evidence suggests that T-MASS not only effectively attracts relevant text-video pairs while distancing irrelevant ones, but also enables the determination of precise text embeddings for relevant pairs. Our experimental results show a substantial improvement of T-MASS over baseline (3% to 6.3% by R@1). Also, T-MASS achieves state-of-the-art performance on five benchmark datasets, including MSRVTT, LSMDC, DiDeMo, VATEX, and Charades.

Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval

TL;DR

This work tackles the mismatch between concise text and richly described videos in text-video retrieval by reframing text as a stochastic embedding, or text mass, rather than a single point. It introduces T-MASS, combining a learnable similarity-aware radius and a regularization using a support text vector to broaden the semantic coverage of text in the joint embedding space. The method trains with a stochastic loss on sampled text masses and performs inference by sampling multiple masses and selecting the best match, yielding improved alignment and robustness. Across five benchmarks, T-MASS achieves state-of-the-art results and substantial gains over baselines, demonstrating the value of richer, uncertainty-aware text representations for multimodal retrieval.

Abstract

The increasing prevalence of video clips has sparked growing interest in text-video retrieval. Recent advances focus on establishing a joint embedding space for text and video, relying on consistent embedding representations to compute similarity. However, the text content in existing datasets is generally short and concise, making it hard to fully describe the redundant semantics of a video. Correspondingly, a single text embedding may be less expressive to capture the video embedding and empower the retrieval. In this study, we propose a new stochastic text modeling method T-MASS, i.e., text is modeled as a stochastic embedding, to enrich text embedding with a flexible and resilient semantic range, yielding a text mass. To be specific, we introduce a similarity-aware radius module to adapt the scale of the text mass upon the given text-video pairs. Plus, we design and develop a support text regularization to further control the text mass during the training. The inference pipeline is also tailored to fully exploit the text mass for accurate retrieval. Empirical evidence suggests that T-MASS not only effectively attracts relevant text-video pairs while distancing irrelevant ones, but also enables the determination of precise text embeddings for relevant pairs. Our experimental results show a substantial improvement of T-MASS over baseline (3% to 6.3% by R@1). Also, T-MASS achieves state-of-the-art performance on five benchmark datasets, including MSRVTT, LSMDC, DiDeMo, VATEX, and Charades.
Paper Structure (12 sections, 10 equations, 7 figures, 6 tables)

This paper contains 12 sections, 10 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Text inside a relevant video is hard to fully describe the redundant semantics of the video. Correspondingly, single text embedding may be less expressive to handle the video information in joint space. We propose a new embedding of text mass with a resilient semantic range, to better capture rich video clues.
  • Figure 2: Illustration of the proposed text-video retrieval method T-MASS, which adopts dual-branch CLIP radford2021learning ($\phi_v$ and $\phi_t$) to extract frame features $[\mathbf{f}_1,...,\mathbf{f}_{T'}]$ and text embedding $\mathbf{t}$. Then a feature fusion module $\psi$ is employed to produce video embedding $\mathbf{v}$. We develop a similarity-aware module $\mathcal{R}$ to facilitate the reparameterization kingma2013auto of the stochastic text embedding $\mathbf{t}_s$, yielding a text mass in the joint space. During training, we compute the loss upon $\mathbf{v}$ and random sampled $\mathbf{t}_s$. During evaluation, we collect a group of $\mathbf{t}_s$ and select the one exhibiting the highest similarity with $\mathbf{v}$. We visualize the learned radius $\mathcal{R}$ for relevant/irrelevant pairs. More details are in Section \ref{['subsec: T-MASS']}.
  • Figure 3: Dynamics of $\mathcal{R}$. We plot $|\mathcal{R}|_1$ for a relevant $t$-$v$ pair (130-th in MSRVTT-1K, video on the right) and the query text with $999$ irrelevant videos. T-MASS learns a precise text semantics for the relevant pair (smallest $|\mathcal{R}|_1$). This is typically observed on correctly retrieved pairs. More examples are in supplementary.
  • Figure 4: Support text regularization. Besides computing the loss between the video embedding $\mathbf{v}$ and stochastic text embedding $\mathbf{t}_s$, we identify a support text embedding locating along the direction from $\mathbf{v}$ to $\mathbf{t}$ and being placed at the surface of the text mass, which serves as a proxy to enable text mass shifting and scaling.
  • Figure 5: Analysis of stochastic text embedding $\mathbf{t}_s$, text embedding $\mathbf{t}$, and video embedding $\mathbf{v}$ in a joint space. Left: Cosine similarities of irrelevant text-video pairs in embedding space. Right: Cross entropy values of relevant text-video pairs in embedding space. The proposed stochastic text embedding allows a lower similarity for irrelevant pairs and enables lower cross entropy loss for relevant pairs.
  • ...and 2 more figures