Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning

Mengshi Qi; Changsheng Lv; Huadong Ma

Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning

Mengshi Qi, Changsheng Lv, Huadong Ma

TL;DR

This work tackles physical audiovisual commonsense reasoning under modality incompleteness by introducing RDCL, a framework that disentangles video content into static and dynamic factors via a Disentangled Sequential Encoder and enhances reasoning with a Counterfactual Learning Module that captures cross-object physical knowledge through an affinity-based graph and counterfactual interventions. An Incomplete Multi-Modal Learning Module further enables recovery of missing modalities by leveraging shared semantics across modalities. The approach is evaluated on PACS, showing consistent improvements over strong baselines and robustness to missing data, with detailed ablations validating the contributions of disentanglement, causal reasoning, and modality-completion components. Collectively, RDCL provides a versatile, plug-in solution that improves multimodal physical commonsense reasoning and offers insights for robust embodied AI systems.

Abstract

In this paper, we propose a new Robust Disentangled Counterfactual Learning (RDCL) approach for physical audiovisual commonsense reasoning. The task aims to infer objects' physics commonsense based on both video and audio input, with the main challenge being how to imitate the reasoning ability of humans, even under the scenario of missing modalities. Most of the current methods fail to take full advantage of different characteristics in multi-modal data, and lacking causal reasoning ability in models impedes the progress of implicit physical knowledge inferring. To address these issues, our proposed RDCL method decouples videos into static (time-invariant) and dynamic (time-varying) factors in the latent space by the disentangled sequential encoder, which adopts a variational autoencoder (VAE) to maximize the mutual information with a contrastive loss function. Furthermore, we introduce a counterfactual learning module to augment the model's reasoning ability by modeling physical knowledge relationships among different objects under counterfactual intervention. To alleviate the incomplete modality data issue, we introduce a robust multimodal learning method to recover the missing data by decomposing the shared features and model-specific features. Our proposed method is a plug-and-play module that can be incorporated into any baseline including VLMs. In experiments, we show that our proposed method improves the reasoning accuracy and robustness of baseline methods and achieves the state-of-the-art performance.

Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning

TL;DR

Abstract

Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)