Table of Contents
Fetching ...

Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning

Mengshi Qi, Changsheng Lv, Huadong Ma

TL;DR

This work tackles physical audiovisual commonsense reasoning under modality incompleteness by introducing RDCL, a framework that disentangles video content into static and dynamic factors via a Disentangled Sequential Encoder and enhances reasoning with a Counterfactual Learning Module that captures cross-object physical knowledge through an affinity-based graph and counterfactual interventions. An Incomplete Multi-Modal Learning Module further enables recovery of missing modalities by leveraging shared semantics across modalities. The approach is evaluated on PACS, showing consistent improvements over strong baselines and robustness to missing data, with detailed ablations validating the contributions of disentanglement, causal reasoning, and modality-completion components. Collectively, RDCL provides a versatile, plug-in solution that improves multimodal physical commonsense reasoning and offers insights for robust embodied AI systems.

Abstract

In this paper, we propose a new Robust Disentangled Counterfactual Learning (RDCL) approach for physical audiovisual commonsense reasoning. The task aims to infer objects' physics commonsense based on both video and audio input, with the main challenge being how to imitate the reasoning ability of humans, even under the scenario of missing modalities. Most of the current methods fail to take full advantage of different characteristics in multi-modal data, and lacking causal reasoning ability in models impedes the progress of implicit physical knowledge inferring. To address these issues, our proposed RDCL method decouples videos into static (time-invariant) and dynamic (time-varying) factors in the latent space by the disentangled sequential encoder, which adopts a variational autoencoder (VAE) to maximize the mutual information with a contrastive loss function. Furthermore, we introduce a counterfactual learning module to augment the model's reasoning ability by modeling physical knowledge relationships among different objects under counterfactual intervention. To alleviate the incomplete modality data issue, we introduce a robust multimodal learning method to recover the missing data by decomposing the shared features and model-specific features. Our proposed method is a plug-and-play module that can be incorporated into any baseline including VLMs. In experiments, we show that our proposed method improves the reasoning accuracy and robustness of baseline methods and achieves the state-of-the-art performance.

Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning

TL;DR

This work tackles physical audiovisual commonsense reasoning under modality incompleteness by introducing RDCL, a framework that disentangles video content into static and dynamic factors via a Disentangled Sequential Encoder and enhances reasoning with a Counterfactual Learning Module that captures cross-object physical knowledge through an affinity-based graph and counterfactual interventions. An Incomplete Multi-Modal Learning Module further enables recovery of missing modalities by leveraging shared semantics across modalities. The approach is evaluated on PACS, showing consistent improvements over strong baselines and robustness to missing data, with detailed ablations validating the contributions of disentanglement, causal reasoning, and modality-completion components. Collectively, RDCL provides a versatile, plug-in solution that improves multimodal physical commonsense reasoning and offers insights for robust embodied AI systems.

Abstract

In this paper, we propose a new Robust Disentangled Counterfactual Learning (RDCL) approach for physical audiovisual commonsense reasoning. The task aims to infer objects' physics commonsense based on both video and audio input, with the main challenge being how to imitate the reasoning ability of humans, even under the scenario of missing modalities. Most of the current methods fail to take full advantage of different characteristics in multi-modal data, and lacking causal reasoning ability in models impedes the progress of implicit physical knowledge inferring. To address these issues, our proposed RDCL method decouples videos into static (time-invariant) and dynamic (time-varying) factors in the latent space by the disentangled sequential encoder, which adopts a variational autoencoder (VAE) to maximize the mutual information with a contrastive loss function. Furthermore, we introduce a counterfactual learning module to augment the model's reasoning ability by modeling physical knowledge relationships among different objects under counterfactual intervention. To alleviate the incomplete modality data issue, we introduce a robust multimodal learning method to recover the missing data by decomposing the shared features and model-specific features. Our proposed method is a plug-and-play module that can be incorporated into any baseline including VLMs. In experiments, we show that our proposed method improves the reasoning accuracy and robustness of baseline methods and achieves the state-of-the-art performance.

Paper Structure

This paper contains 28 sections, 31 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of our main tasks. Task (a) involves AVQA for physical commonsense reasoning, while task (b) addresses robust AVQA, which deals with missing modality data for physical commonsense reasoning encountered in real-world scenarios.
  • Figure 2: The illustration of our proposed DCL model: Part (a) presents the overall structure, which begins with the input of videos accompanied by audio. These are initially encoded via the respective visual and audio encoders. Subsequently, the Disentangled Sequence Encoder in Part (b) is employed to segregate video features into static and dynamic elements utilizing an LSTM-based Variational Autoencoder (VAE). The Counterfactual Learning Module in Part (c) is then used to construct the affinity matrix 'A', which acts as a confounder, and to derive the prediction $\hat{Y}_{X, A_X}$ and the counterfactual outcome $\hat{Y}_{X, A^*_X}$. Ultimately, we compute $\hat{Y}_{TIE}$ by subtracting these two outcomes and optimizing the model.
  • Figure 3: Illustration of our proposed RDCL model. The upper part shows our proposed Incomplete Multi-Modal Learning Method (IMLM) within RDCL during the training stage when the training data is modality-complete. IMLM comprises a unique encoder and a shared encoder, along with a Shared Feature Memory and a Unique Feature Memory. As a plug-in model, the features processed by the IMLM are subsequently fed into the Counterfactual Learning Module (CLM). The lower part presents RDCL during the inference stage when audio data is missing, and we utilize the average value across the shared feature memory to substitute for the missing audio feature.
  • Figure 4: Qualitative Results of baseline w/ and w/o our proposed method, where 'Material' refers to the material of the object. The correct answers are depicted in green while the incorrect ones are depicted in red.
  • Figure 5: Performance comparison of various hyperparameters. Figures (a) and (b) show the performance of AudioCLIP with different frame lengths $T$ in DSE and DSE+ on the PACS and PACS-Material datasets. Figures (c) and (d) illustrate the performance of AudioCLIP with varying numbers of top-$K$ physical knowledge relationships on the same datasets.
  • ...and 2 more figures