CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

Shoubin Yu; Jaehong Yoon; Mohit Bansal

CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

Shoubin Yu, Jaehong Yoon, Mohit Bansal

TL;DR

CREMA tackles the efficiency and flexibility bottlenecks in multimodal video-language reasoning by building a modular fusion framework on top of a frozen vision-language backbone. It introduces a Multimodal Q-Former with modality-specific MMQA adapters and a self-gated fusion mechanism to fuse diverse inputs (video, audio, depth, flow, normals, touch, thermal) with minimal parameter updates. A modality-sequential training regime with adaptive early exit further improves training efficiency and balance across modalities. Across seven video reasoning benchmarks, CREMA achieves state-of-the-art or comparable performance while reducing trainable parameters by over 90%, demonstrating strong zero-shot and few-shot adaptability with broad modality support.

Abstract

Despite impressive advancements in recent multimodal reasoning approaches, they are still limited in flexibility and efficiency, as these models typically process only a few fixed modality inputs and require updates to numerous parameters. This paper tackles these critical challenges and proposes CREMA, a generalizable, highly efficient, and modular modality-fusion framework that can incorporate any new modality to enhance video reasoning. We first augment multiple informative modalities (such as optical flow, 3D point cloud, audio, thermal heatmap, and touch map) from given videos without extra human annotation by leveraging sensors or existing pre-trained models. Next, we introduce a query transformer with multiple parameter-efficient modules associated with each accessible modality. It projects diverse modality features to the LLM token embedding space, allowing the model to integrate different data types for response generation. Furthermore, we propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy. It helps compress information across various assisting modalities, maintaining computational efficiency in the LLM while improving performance. We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including conventional VideoQA and Video-Audio/3D/Touch/Thermal QA, and achieve better/equivalent performance against strong multimodal LLMs, including OneLLM, BLIP-2, and SeViLA while reducing over 90% trainable parameters. We provide extensive analyses of CREMA, including the impact of each modality on reasoning domains, the design of the fusion module, and example visualizations.

CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

TL;DR

Abstract

Paper Structure (34 sections, 3 equations, 5 figures, 23 tables)

This paper contains 34 sections, 3 equations, 5 figures, 23 tables.

Introduction
Related Works
Learning with Multiple Modalities.
Multimodal Large Language Model.
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
Preliminaries: Q-Former
Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion
Multimodal Encoders.
Multimodal Q-Former.
Self-gated Multimodal Query Fusion.
Modality-Sequential and Modular Training of CREMA
Experiments
Experimental Setup
Implementation Details.
Main Experimental Results
...and 19 more sections

Figures (5)

Figure 1: Overview of the CREMA architecture & training.Left: Multimodal encoders, Q-former, and LLM are kept frozen in the process. For each modality input, we extract tokens using a corresponding modality-specific adaptation module. Then, we employ the fusion module to blend and compress the obtained multimodal tokens. In the end, the LLM uses modality-fusion tokens to generate responses. Right: We present a modality-sequential training and modality-adaptive early exist strategy, further boosting the training efficiency while allowing faster modality adaptation.
Figure 2: Modality-adaptive Early Exit of CREMA on SQA3D. CREMA stops to update MMQA modules for a specific modality once the corresponding indicator value reaches $1.0$.
Figure 3: Qualitative examples for multimodal compositional video reasoning from SQA3D (Left) and MUSIC-AVQA (Right). The correct predictions are marked by green check marks.
Figure 4: Qualitative examples for multimodal compositional video reasoning from SQA3D (Left) and MUSIC-AVQA (Right). The correct predictions are marked by green checks.
Figure 5: Visualization on attention map under different modality combinations. Top: with audio and video. Bottom: with audio, optical flow, and video. We omit audio for simplicity. We highlight attention regions that may affect model prediction with red boxes.

CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

TL;DR

Abstract

CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

Authors

TL;DR

Abstract

Table of Contents

Figures (5)