TAP: The Attention Patch for Cross-Modal Knowledge Transfer from Unlabeled Modality

Yinsong Wang; Shahin Shahrampour

TAP: The Attention Patch for Cross-Modal Knowledge Transfer from Unlabeled Modality

Yinsong Wang, Shahin Shahrampour

TL;DR

This work proposes The Attention Patch (TAP), a neural network plugin enabling cross-modal knowledge transfer from unlabeled, unpaired secondary modalities to improve the primary modality's performance. It derives a missing-information estimator based on Nadaraya-Watson kernel regression, which under linear latent transforms yields a kernelized cross-attention mechanism, and implements TAP as a trainable add-on using learnable projections $\mathbf{W}_q,\mathbf{W}_k,\mathbf{W}_v$. The approach is analyzed theoretically—showing NW regression convergence and an explicit cross-attention form—and validated empirically across four real-world datasets, demonstrating consistent generalization gains and compatibility with various backbones. This framework enables data-level knowledge transfer from readily available unlabeled cross-modal data, with practical batching strategies to manage memory and broad potential applications across domains.

Abstract

This paper addresses a cross-modal learning framework, where the objective is to enhance the performance of supervised learning in the primary modality using an unlabeled, unpaired secondary modality. Taking a probabilistic approach for missing information estimation, we show that the extra information contained in the secondary modality can be estimated via Nadaraya-Watson (NW) kernel regression, which can further be expressed as a kernelized cross-attention module (under linear transformation). This expression lays the foundation for introducing The Attention Patch (TAP), a simple neural network add-on that can be trained to allow data-level knowledge transfer from the unlabeled modality. We provide extensive numerical simulations using real-world datasets to show that TAP can provide statistically significant improvement in generalization across different domains and different neural network architectures, making use of seemingly unusable unlabeled cross-modal data.

TAP: The Attention Patch for Cross-Modal Knowledge Transfer from Unlabeled Modality

TL;DR

. The approach is analyzed theoretically—showing NW regression convergence and an explicit cross-attention form—and validated empirically across four real-world datasets, demonstrating consistent generalization gains and compatibility with various backbones. This framework enables data-level knowledge transfer from readily available unlabeled cross-modal data, with practical batching strategies to manage memory and broad potential applications across domains.

Abstract

Paper Structure (24 sections, 4 theorems, 24 equations, 5 figures, 5 tables)

This paper contains 24 sections, 4 theorems, 24 equations, 5 figures, 5 tables.

Introduction
Summary of Contributions
Related Literature
Cross-Modal Learning
Semi-Supervised Learning
Estimating the Missing Information
Cross-Modal NW Kernel Regression
Estimation Error Guarantee
The Attention Patch
Cross-Attention Module
Batch Training
Numerical Experiments
Performance Evaluation
Ablation Study
Conclusion and Discussion of Future Directions
...and 9 more sections

Key Result

Proposition 1

The missing information estimation formulation in eq:expectation can be approximated with kernel density estimators in eq:kde. When the kernel function $k_1(\cdot, \boldsymbol{\mu})$ in eq:kde is a density function for a distribution with mean $\boldsymbol{\mu}$, the approximation leads to

Figures (5)

Figure 1:
Figure 2: The Attention Patch (TAP) neural network integration visualization: TAP takes the output of a layer to calculate the missing representation using reference data $\mathbf{Z}$, and the output of TAP will be concatenated with TAP input and fed to the next layer. The only modification to the original deep neural network (DNN) is increasing the input dimension of the integration layer (blue layer).
Figure 3: Simulation results on three real-world datasets. TAP integration shows a consistent performance advantage compared to other variants.
Figure 4: Reference batch size comparison on three real-world datasets. The generalization accuracy increases as the reference batch size becomes larger.
Figure 5: TAP integration with pre-trained feature extractors: The primary modality prediction model takes a meme image as input to predict the sentiment of the meme. A text set of batch size $100$ is used as the reference secondary modality in TAP. The text data goes through pre-trained distilled-RoBERTa before being used in TAP.

Theorems & Definitions (4)

Proposition 1
Theorem 1
Corollary 1
Lemma 1

TAP: The Attention Patch for Cross-Modal Knowledge Transfer from Unlabeled Modality

TL;DR

Abstract

TAP: The Attention Patch for Cross-Modal Knowledge Transfer from Unlabeled Modality

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (4)