TAP: The Attention Patch for Cross-Modal Knowledge Transfer from Unlabeled Modality
Yinsong Wang, Shahin Shahrampour
TL;DR
This work proposes The Attention Patch (TAP), a neural network plugin enabling cross-modal knowledge transfer from unlabeled, unpaired secondary modalities to improve the primary modality's performance. It derives a missing-information estimator based on Nadaraya-Watson kernel regression, which under linear latent transforms yields a kernelized cross-attention mechanism, and implements TAP as a trainable add-on using learnable projections $\mathbf{W}_q,\mathbf{W}_k,\mathbf{W}_v$. The approach is analyzed theoretically—showing NW regression convergence and an explicit cross-attention form—and validated empirically across four real-world datasets, demonstrating consistent generalization gains and compatibility with various backbones. This framework enables data-level knowledge transfer from readily available unlabeled cross-modal data, with practical batching strategies to manage memory and broad potential applications across domains.
Abstract
This paper addresses a cross-modal learning framework, where the objective is to enhance the performance of supervised learning in the primary modality using an unlabeled, unpaired secondary modality. Taking a probabilistic approach for missing information estimation, we show that the extra information contained in the secondary modality can be estimated via Nadaraya-Watson (NW) kernel regression, which can further be expressed as a kernelized cross-attention module (under linear transformation). This expression lays the foundation for introducing The Attention Patch (TAP), a simple neural network add-on that can be trained to allow data-level knowledge transfer from the unlabeled modality. We provide extensive numerical simulations using real-world datasets to show that TAP can provide statistically significant improvement in generalization across different domains and different neural network architectures, making use of seemingly unusable unlabeled cross-modal data.
