AKRMap: Adaptive Kernel Regression for Trustworthy Visualization of Cross-Modal Embeddings
Yilin Ye, Junchao Huang, Xingchen Zeng, Jiazhi Xia, Wei Zeng
TL;DR
AKRMap addresses the interpretability gap in cross-modal embeddings by learning a 2D projection guided by kernel regression of the metric landscape. It jointly optimizes a parametric projection and an adaptive generalized kernel, using a loss L = $\lambda MSE_r + KL$ where $MSE_r = w_1 MSE_{vl} + w_2 MSE_{tr}$, enabling accurate contour mapping of cross-modal metrics. A key innovation is the adaptive kernel $K(\mathbf{x}, \alpha, \beta) = (1 + \alpha \|\mathbf{x}\|^{2\beta})^{-1}$, with $\alpha$ and $\beta$ learned during training to fit complex landscapes. The resulting AKRMap provides scatterplots and multi-scale contour maps with interactive zoom/overlay, and experiments on HPD and model comparisons show improved accuracy and trustworthiness over traditional DR methods, demonstrating practical value for evaluating text-to-image generation and enabling human-in-the-loop analysis.
Abstract
Cross-modal embeddings form the foundation for multi-modal models. However, visualization methods for interpreting cross-modal embeddings have been primarily confined to traditional dimensionality reduction (DR) techniques like PCA and t-SNE. These DR methods primarily focus on feature distributions within a single modality, whilst failing to incorporate metrics (e.g., CLIPScore) across multiple modalities. This paper introduces AKRMap, a new DR technique designed to visualize cross-modal embeddings metric with enhanced accuracy by learning kernel regression of the metric landscape in the projection space. Specifically, AKRMap constructs a supervised projection network guided by a post-projection kernel regression loss, and employs adaptive generalized kernels that can be jointly optimized with the projection. This approach enables AKRMap to efficiently generate visualizations that capture complex metric distributions, while also supporting interactive features such as zoom and overlay for deeper exploration. Quantitative experiments demonstrate that AKRMap outperforms existing DR methods in generating more accurate and trustworthy visualizations. We further showcase the effectiveness of AKRMap in visualizing and comparing cross-modal embeddings for text-to-image models. Code and demo are available at https://github.com/yilinye/AKRMap.
