Table of Contents
Fetching ...

When Kernels Multiply, Clusters Unify: Fusing Embeddings with the Kronecker Product

Youqi Wu, Jingwei Zhang, Farzan Farnia

TL;DR

The paper tackles the problem of unifying complementary embeddings by exploiting kernel multiplication, which yields a fused similarity that reflects the union of parent clustering structures. It formalizes KrossFuse, a Kronecker-product-based fusion that links the product of marginal kernels to a Kronecker feature map, and extends it to cross-modal versus uni-modal combinations. To address scalability, RP-KrossFuse approximates the high-dimensional Kronecker space via random projections, with theoretical guarantees and extensions to shift-invariant kernels. Empirically, RP-KrossFuse enhances modality-specific performance while preserving cross-modal alignment, demonstrated on image and text benchmarks and across several embedding families (e.g., CLIP, DINOv2, S-RoBERTa, E5). The approach offers a training-free, scalable path to fuse diverse embeddings, enabling improved unimodal representations without sacrificing cross-modal coherence.

Abstract

State-of-the-art embeddings often capture distinct yet complementary discriminative features: For instance, one image embedding model may excel at distinguishing fine-grained textures, while another focuses on object-level structure. Motivated by this observation, we propose a principled approach to fuse such complementary representations through kernel multiplication. Multiplying the kernel similarity functions of two embeddings allows their discriminative structures to interact, producing a fused representation whose kernel encodes the union of the clusters identified by each parent embedding. This formulation also provides a natural way to construct joint kernels for paired multi-modal data (e.g., image-text tuples), where the product of modality-specific kernels inherits structure from both domains. We highlight that this kernel product is mathematically realized via the Kronecker product of the embedding feature maps, yielding our proposed KrossFuse framework for embedding fusion. To address the computational cost of the resulting high-dimensional Kronecker space, we further develop RP-KrossFuse, a scalable variant that leverages random projections for efficient approximation. As a key application, we use this framework to bridge the performance gap between cross-modal embeddings (e.g., CLIP, BLIP) and unimodal experts (e.g., DINOv2, E5). Experiments show that RP-KrossFuse effectively integrates these models, enhancing modality-specific performance while preserving cross-modal alignment. The project code is available at https://github.com/yokiwuuu/KrossFuse.

When Kernels Multiply, Clusters Unify: Fusing Embeddings with the Kronecker Product

TL;DR

The paper tackles the problem of unifying complementary embeddings by exploiting kernel multiplication, which yields a fused similarity that reflects the union of parent clustering structures. It formalizes KrossFuse, a Kronecker-product-based fusion that links the product of marginal kernels to a Kronecker feature map, and extends it to cross-modal versus uni-modal combinations. To address scalability, RP-KrossFuse approximates the high-dimensional Kronecker space via random projections, with theoretical guarantees and extensions to shift-invariant kernels. Empirically, RP-KrossFuse enhances modality-specific performance while preserving cross-modal alignment, demonstrated on image and text benchmarks and across several embedding families (e.g., CLIP, DINOv2, S-RoBERTa, E5). The approach offers a training-free, scalable path to fuse diverse embeddings, enabling improved unimodal representations without sacrificing cross-modal coherence.

Abstract

State-of-the-art embeddings often capture distinct yet complementary discriminative features: For instance, one image embedding model may excel at distinguishing fine-grained textures, while another focuses on object-level structure. Motivated by this observation, we propose a principled approach to fuse such complementary representations through kernel multiplication. Multiplying the kernel similarity functions of two embeddings allows their discriminative structures to interact, producing a fused representation whose kernel encodes the union of the clusters identified by each parent embedding. This formulation also provides a natural way to construct joint kernels for paired multi-modal data (e.g., image-text tuples), where the product of modality-specific kernels inherits structure from both domains. We highlight that this kernel product is mathematically realized via the Kronecker product of the embedding feature maps, yielding our proposed KrossFuse framework for embedding fusion. To address the computational cost of the resulting high-dimensional Kronecker space, we further develop RP-KrossFuse, a scalable variant that leverages random projections for efficient approximation. As a key application, we use this framework to bridge the performance gap between cross-modal embeddings (e.g., CLIP, BLIP) and unimodal experts (e.g., DINOv2, E5). Experiments show that RP-KrossFuse effectively integrates these models, enhancing modality-specific performance while preserving cross-modal alignment. The project code is available at https://github.com/yokiwuuu/KrossFuse.

Paper Structure

This paper contains 30 sections, 4 theorems, 34 equations, 26 figures, 11 tables, 2 algorithms.

Key Result

Proposition 1

Consider feature maps $\phi_1:\mathcal{Z}_1\rightarrow \mathbb{R}^{d_1},\, \phi_2:\mathcal{Z}_2\rightarrow \mathbb{R}^{d_2}$ and their corresponding kernel functions $k_1,\,k_2$. Then, given the kernel functions defined in Eq: Kernel Definition Two Embeddings, the product kernel function $k_{\gamma_

Figures (26)

  • Figure 1: Heatmaps of RBF kernel similarity matrices for an image dataset with four groundtruth clusters (two dog classes in ImageNet and two traffic sign classes in GTSRB) (left) $K_1$ for CLIP, (middle) $K_2$ for DINOv2, (right) $K_1\odot K_2$ elementwise product for CLIP and DINOv2's Kronecker product. Unlike CLIP and DINOv2, their Kronecker product could cluster the four image classes.
  • Figure 2: Kernel similarity heatmaps for (text,image) data with 6 underlying clusters. While the kernel matrix of the concatenated CLIP text and DINOv2 image embeddings blur cluster boundaries, the kernel matrices' Hadamard product (for Kronecker-fused embedding) separates all the 6 groups.
  • Figure 3: The Kronecker product fusion of embeddings in our proposed KrossFuse: The RP-KrossFuse fusion (implemented with Random Projection) of CLIP and DINOv2 could improve the averaged few-shot classification accuracy over CLIP on 9 benchmark image datasets.
  • Figure 4: Clustering results and kernel matrix heatmaps for CLIP, DINOv2, and KrossFuse on ImageNet dog breeds and GTSRB dataset. While CLIP could not fully separate all the dog categories and DINOv2 struggled in clustering traffic signs, the KrossFuse fusion captured the clusters.
  • Figure 5: The cosine similarity distributions in MSCOCO.
  • ...and 21 more figures

Theorems & Definitions (7)

  • Proposition 1
  • Remark 1
  • Proposition 2
  • Remark 2
  • Theorem 1
  • Theorem 2
  • proof