When Kernels Multiply, Clusters Unify: Fusing Embeddings with the Kronecker Product

Youqi Wu; Jingwei Zhang; Farzan Farnia

When Kernels Multiply, Clusters Unify: Fusing Embeddings with the Kronecker Product

Youqi Wu, Jingwei Zhang, Farzan Farnia

TL;DR

The paper tackles the problem of unifying complementary embeddings by exploiting kernel multiplication, which yields a fused similarity that reflects the union of parent clustering structures. It formalizes KrossFuse, a Kronecker-product-based fusion that links the product of marginal kernels to a Kronecker feature map, and extends it to cross-modal versus uni-modal combinations. To address scalability, RP-KrossFuse approximates the high-dimensional Kronecker space via random projections, with theoretical guarantees and extensions to shift-invariant kernels. Empirically, RP-KrossFuse enhances modality-specific performance while preserving cross-modal alignment, demonstrated on image and text benchmarks and across several embedding families (e.g., CLIP, DINOv2, S-RoBERTa, E5). The approach offers a training-free, scalable path to fuse diverse embeddings, enabling improved unimodal representations without sacrificing cross-modal coherence.

Abstract

State-of-the-art embeddings often capture distinct yet complementary discriminative features: For instance, one image embedding model may excel at distinguishing fine-grained textures, while another focuses on object-level structure. Motivated by this observation, we propose a principled approach to fuse such complementary representations through kernel multiplication. Multiplying the kernel similarity functions of two embeddings allows their discriminative structures to interact, producing a fused representation whose kernel encodes the union of the clusters identified by each parent embedding. This formulation also provides a natural way to construct joint kernels for paired multi-modal data (e.g., image-text tuples), where the product of modality-specific kernels inherits structure from both domains. We highlight that this kernel product is mathematically realized via the Kronecker product of the embedding feature maps, yielding our proposed KrossFuse framework for embedding fusion. To address the computational cost of the resulting high-dimensional Kronecker space, we further develop RP-KrossFuse, a scalable variant that leverages random projections for efficient approximation. As a key application, we use this framework to bridge the performance gap between cross-modal embeddings (e.g., CLIP, BLIP) and unimodal experts (e.g., DINOv2, E5). Experiments show that RP-KrossFuse effectively integrates these models, enhancing modality-specific performance while preserving cross-modal alignment. The project code is available at https://github.com/yokiwuuu/KrossFuse.

When Kernels Multiply, Clusters Unify: Fusing Embeddings with the Kronecker Product

TL;DR

Abstract

When Kernels Multiply, Clusters Unify: Fusing Embeddings with the Kronecker Product

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (26)

Theorems & Definitions (7)