Table of Contents
Fetching ...

Multi-Modal Video Dialog State Tracking in the Wild

Adnen Abdessaied, Lei Shi, Andreas Bulling

TL;DR

This work tackles the challenge of multi-modal video dialog state tracking in real-world settings by proposing MST-MIXER, a framework that first performs modality-specific tracking to identify salient constituents, then learns local latent graphs for each modality, and finally composes these into a global multimodal graph to refine the Vision-Language Model backbone. It introduces a two-stage divide-and-conquer graph learning approach with variational inference to estimate latent graphs, guided by an ELBO-based objective and a multi-modal conditioning mechanism. Across five benchmarks (AVSD DSTC7/8/10, SIMMC 2.0, and NExT-QA), MST-MIXER sets new state-of-the-art results and demonstrates robustness to real-world multimodal data, with ablations confirming the importance of local/global graph learning, MMC, and initialization bias. The method advances practical dialog agents for complex, real-world video understanding by enabling explicit, learnable cross-modal structure that enhances answer generation.

Abstract

We present MST-MIXER - a novel video dialog model operating over a generic multi-modal state tracking scheme. Current models that claim to perform multi-modal state tracking fall short of two major aspects: (1) They either track only one modality (mostly the visual input) or (2) they target synthetic datasets that do not reflect the complexity of real-world in the wild scenarios. Our model addresses these two limitations in an attempt to close this crucial research gap. Specifically, MST-MIXER first tracks the most important constituents of each input modality. Then, it predicts the missing underlying structure of the selected constituents of each modality by learning local latent graphs using a novel multi-modal graph structure learning method. Subsequently, the learned local graphs and features are parsed together to form a global graph operating on the mix of all modalities which further refines its structure and node embeddings. Finally, the fine-grained graph node features are used to enhance the hidden states of the backbone Vision-Language Model (VLM). MST-MIXER achieves new state-of-the-art results on five challenging benchmarks.

Multi-Modal Video Dialog State Tracking in the Wild

TL;DR

This work tackles the challenge of multi-modal video dialog state tracking in real-world settings by proposing MST-MIXER, a framework that first performs modality-specific tracking to identify salient constituents, then learns local latent graphs for each modality, and finally composes these into a global multimodal graph to refine the Vision-Language Model backbone. It introduces a two-stage divide-and-conquer graph learning approach with variational inference to estimate latent graphs, guided by an ELBO-based objective and a multi-modal conditioning mechanism. Across five benchmarks (AVSD DSTC7/8/10, SIMMC 2.0, and NExT-QA), MST-MIXER sets new state-of-the-art results and demonstrates robustness to real-world multimodal data, with ablations confirming the importance of local/global graph learning, MMC, and initialization bias. The method advances practical dialog agents for complex, real-world video understanding by enabling explicit, learnable cross-modal structure that enhances answer generation.

Abstract

We present MST-MIXER - a novel video dialog model operating over a generic multi-modal state tracking scheme. Current models that claim to perform multi-modal state tracking fall short of two major aspects: (1) They either track only one modality (mostly the visual input) or (2) they target synthetic datasets that do not reflect the complexity of real-world in the wild scenarios. Our model addresses these two limitations in an attempt to close this crucial research gap. Specifically, MST-MIXER first tracks the most important constituents of each input modality. Then, it predicts the missing underlying structure of the selected constituents of each modality by learning local latent graphs using a novel multi-modal graph structure learning method. Subsequently, the learned local graphs and features are parsed together to form a global graph operating on the mix of all modalities which further refines its structure and node embeddings. Finally, the fine-grained graph node features are used to enhance the hidden states of the backbone Vision-Language Model (VLM). MST-MIXER achieves new state-of-the-art results on five challenging benchmarks.
Paper Structure (48 sections, 14 equations, 9 figures, 12 tables)

This paper contains 48 sections, 14 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: $\mathbb{MST}_\mathbb{MIXER}$achieves SOTA results on various video-language tasks.
  • Figure 2: $\mathbb{MST}_\mathbb{MIXER}$takes a video , a dialog history , and a question as input and autoregressively generates an answer as output. It uses a BART backbone adapted to deal with multi-modal input features and enhanced via our graph-based mixing approach.
  • Figure 3: In Stage I, $\mathbb{MST}_\mathbb{MIXER}$ first gathers multi-modal features $\{X_i\}$ from the previous BART layer and computes their respective initial local structures $\{\tilde{A}_I \}$. Then, it simultaneously learns the local latent multi-modal graphs and refines the features using a two-stream framework, i.e., $\{A'_{i,j}, A"_{i,j}\}_j$ and $\{Z'_{i,j}, Z"_{i,j}\}_j$, respectively. Finally, it outputs the final multi-modal latent graph $A_i$ used to compute the local ELBO loss $\mathcal{L}_\textrm{ELBO}^\textrm{local} = \frac{1}{N}\sum_{i=1}^N \mathcal{L}_\textrm{ELBO}^{\textrm{local}, i}$.
  • Figure 4: Overview of mixing stage II.
  • Figure 5: a) Larger values of $K$ make the learning of the global latent graphs more challenging. b) The local ELBO loss $\mathbf{\mathcal{L}_\textrm{ELBO}^\mathrm{local}}$ facilitates the learning of the global latent graphs. c) The global ELBO loss $\mathbf{\mathcal{L}_\textrm{ELBO}^\mathrm{global}}$ facilitates the learning of the local latent graphs. All models use SAM and audio features.
  • ...and 4 more figures