Table of Contents
Fetching ...

Expressive and Scalable Quantum Fusion for Multimodal Learning

Tuyen Nguyen, Trong Nghia Hoang, Phi Le Nguyen, Hai L. Vu, Truong Cong Thang

TL;DR

This paper introduces the Quantum Fusion Layer (QFL), a hybrid quantum–classical approach for multimodal fusion that uses quantum signal processing to encode high-order cross-modal interactions with linear parameter growth. It provides an expressivity theorem, a parameter-scaling analysis, and a separation result against low-rank tensor methods, complemented by empirical evidence showing QFL's advantages in high-modality tasks. The architecture combines multimodal state preparation, a parameterized quantum circuit, and randomized measurements to realize degree-P polynomial interactions across modalities, trained end-to-end with classical optimization. While the results are proof-of-concept and simulations, they demonstrate both theoretical and practical potential for scalable quantum fusion, and they outline open challenges for hardware, data scale, and broader model comparisons in multimodal learning.

Abstract

The aim of this paper is to introduce a quantum fusion mechanism for multimodal learning and to establish its theoretical and empirical potential. The proposed method, called the Quantum Fusion Layer (QFL), replaces classical fusion schemes with a hybrid quantum-classical procedure that uses parameterized quantum circuits to learn entangled feature interactions without requiring exponential parameter growth. Supported by quantum signal processing principles, the quantum component efficiently represents high-order polynomial interactions across modalities with linear parameter scaling, and we provide a separation example between QFL and low-rank tensor-based methods that highlights potential quantum query advantages. In simulation, QFL consistently outperforms strong classical baselines on small but diverse multimodal tasks, with particularly marked improvements in high-modality regimes. These results suggest that QFL offers a fundamentally new and scalable approach to multimodal fusion that merits deeper exploration on larger systems.

Expressive and Scalable Quantum Fusion for Multimodal Learning

TL;DR

This paper introduces the Quantum Fusion Layer (QFL), a hybrid quantum–classical approach for multimodal fusion that uses quantum signal processing to encode high-order cross-modal interactions with linear parameter growth. It provides an expressivity theorem, a parameter-scaling analysis, and a separation result against low-rank tensor methods, complemented by empirical evidence showing QFL's advantages in high-modality tasks. The architecture combines multimodal state preparation, a parameterized quantum circuit, and randomized measurements to realize degree-P polynomial interactions across modalities, trained end-to-end with classical optimization. While the results are proof-of-concept and simulations, they demonstrate both theoretical and practical potential for scalable quantum fusion, and they outline open challenges for hardware, data scale, and broader model comparisons in multimodal learning.

Abstract

The aim of this paper is to introduce a quantum fusion mechanism for multimodal learning and to establish its theoretical and empirical potential. The proposed method, called the Quantum Fusion Layer (QFL), replaces classical fusion schemes with a hybrid quantum-classical procedure that uses parameterized quantum circuits to learn entangled feature interactions without requiring exponential parameter growth. Supported by quantum signal processing principles, the quantum component efficiently represents high-order polynomial interactions across modalities with linear parameter scaling, and we provide a separation example between QFL and low-rank tensor-based methods that highlights potential quantum query advantages. In simulation, QFL consistently outperforms strong classical baselines on small but diverse multimodal tasks, with particularly marked improvements in high-modality regimes. These results suggest that QFL offers a fundamentally new and scalable approach to multimodal fusion that merits deeper exploration on larger systems.

Paper Structure

This paper contains 36 sections, 1 theorem, 31 equations, 3 figures, 2 tables.

Key Result

corollary 1

Let $\mathbf{F}_P(\mathbf{x})$ be the quantum circuit defined in eqn: mqsvt. For any $P > 0$, the circuit can approximate any matrix-valued polynomial of degree at most $P$ on $\mathbb{T}^{Md}$, under the constraint $\det(\mathbf{F}_P(\mathbf{x})) = 1$. The required parameter complexity depends on t

Figures (3)

  • Figure 1: Hybrid Architecture of Quantum Fusion Layer (QFL) for Multimodal Learning. The QFL consists of three key components: (1) multimodal superposition state preparation $\mathbf{S}(\mathbf{x})$, (2) parameterized quantum circuits $\mathbf{U}(\boldsymbol{\theta})$, and (3) a measurement module. The sequence $\mathbf{U}(\boldsymbol{\theta}), \mathbf{S}(\mathbf{x})$ is repeated $P$ times to construct a degree-$P$ multivariate polynomial over the input modalities.
  • Figure 2: Outputs of Quantum Fusion Layer on $\mathbf{x}_1 = e^{-i\theta_1}, \mathbf{x}_2 = e^{-i\theta_2}$ for $\theta_1, \theta_2 \in [0, 2\pi]$ with increasing depth $P=1, 2, 6$.
  • Figure 3: Impacts of the modality on QFL. QFL performance improves as the number of input modalities increases.

Theorems & Definitions (2)

  • corollary 1: Parameter Scaling
  • proof