Table of Contents
Fetching ...

AKRMap: Adaptive Kernel Regression for Trustworthy Visualization of Cross-Modal Embeddings

Yilin Ye, Junchao Huang, Xingchen Zeng, Jiazhi Xia, Wei Zeng

TL;DR

AKRMap addresses the interpretability gap in cross-modal embeddings by learning a 2D projection guided by kernel regression of the metric landscape. It jointly optimizes a parametric projection and an adaptive generalized kernel, using a loss L = $\lambda MSE_r + KL$ where $MSE_r = w_1 MSE_{vl} + w_2 MSE_{tr}$, enabling accurate contour mapping of cross-modal metrics. A key innovation is the adaptive kernel $K(\mathbf{x}, \alpha, \beta) = (1 + \alpha \|\mathbf{x}\|^{2\beta})^{-1}$, with $\alpha$ and $\beta$ learned during training to fit complex landscapes. The resulting AKRMap provides scatterplots and multi-scale contour maps with interactive zoom/overlay, and experiments on HPD and model comparisons show improved accuracy and trustworthiness over traditional DR methods, demonstrating practical value for evaluating text-to-image generation and enabling human-in-the-loop analysis.

Abstract

Cross-modal embeddings form the foundation for multi-modal models. However, visualization methods for interpreting cross-modal embeddings have been primarily confined to traditional dimensionality reduction (DR) techniques like PCA and t-SNE. These DR methods primarily focus on feature distributions within a single modality, whilst failing to incorporate metrics (e.g., CLIPScore) across multiple modalities. This paper introduces AKRMap, a new DR technique designed to visualize cross-modal embeddings metric with enhanced accuracy by learning kernel regression of the metric landscape in the projection space. Specifically, AKRMap constructs a supervised projection network guided by a post-projection kernel regression loss, and employs adaptive generalized kernels that can be jointly optimized with the projection. This approach enables AKRMap to efficiently generate visualizations that capture complex metric distributions, while also supporting interactive features such as zoom and overlay for deeper exploration. Quantitative experiments demonstrate that AKRMap outperforms existing DR methods in generating more accurate and trustworthy visualizations. We further showcase the effectiveness of AKRMap in visualizing and comparing cross-modal embeddings for text-to-image models. Code and demo are available at https://github.com/yilinye/AKRMap.

AKRMap: Adaptive Kernel Regression for Trustworthy Visualization of Cross-Modal Embeddings

TL;DR

AKRMap addresses the interpretability gap in cross-modal embeddings by learning a 2D projection guided by kernel regression of the metric landscape. It jointly optimizes a parametric projection and an adaptive generalized kernel, using a loss L = where , enabling accurate contour mapping of cross-modal metrics. A key innovation is the adaptive kernel , with and learned during training to fit complex landscapes. The resulting AKRMap provides scatterplots and multi-scale contour maps with interactive zoom/overlay, and experiments on HPD and model comparisons show improved accuracy and trustworthiness over traditional DR methods, demonstrating practical value for evaluating text-to-image generation and enabling human-in-the-loop analysis.

Abstract

Cross-modal embeddings form the foundation for multi-modal models. However, visualization methods for interpreting cross-modal embeddings have been primarily confined to traditional dimensionality reduction (DR) techniques like PCA and t-SNE. These DR methods primarily focus on feature distributions within a single modality, whilst failing to incorporate metrics (e.g., CLIPScore) across multiple modalities. This paper introduces AKRMap, a new DR technique designed to visualize cross-modal embeddings metric with enhanced accuracy by learning kernel regression of the metric landscape in the projection space. Specifically, AKRMap constructs a supervised projection network guided by a post-projection kernel regression loss, and employs adaptive generalized kernels that can be jointly optimized with the projection. This approach enables AKRMap to efficiently generate visualizations that capture complex metric distributions, while also supporting interactive features such as zoom and overlay for deeper exploration. Quantitative experiments demonstrate that AKRMap outperforms existing DR methods in generating more accurate and trustworthy visualizations. We further showcase the effectiveness of AKRMap in visualizing and comparing cross-modal embeddings for text-to-image models. Code and demo are available at https://github.com/yilinye/AKRMap.

Paper Structure

This paper contains 27 sections, 9 equations, 15 figures, 11 tables.

Figures (15)

  • Figure 1: CLIPScore distribution on the COCO dataset by t-SNE (a). The visualization shows dense neighboring points with significantly different metric values, causing overlapping and occlusion (b) and highly inaccurate contour mapping (c).
  • Figure 2: AKRMap is a neural network based DR method designed to learn adaptive kernel regression for visualizing cross-modal embeddings. The network integrates two key components to jointly learn data point projection and cross-modal metric estimation: 1) Kernel regression supervision, and 2) Adaptive generalized kernel. The resulting visualizations, including scatterplots and contour maps, provide a clearer and more accurate representation of the cross-modal metric distribution.
  • Figure 3: Comparison of scatterplots generated by t-SNE and AKRMap for the HPSv2 metric. Despite the visual clutter introduced by the large-scale dataset, AKRMap provides a clearer and more accurate representation of the HPSv2 metric distribution.
  • Figure 4: Contour map combined with zoom and overlay with point sampling for multiscale exploration of the HPD dataset.
  • Figure 5: Qualitative comparison of contour map visualizations of ClipScore, HPSv2, PickScore, and Aesthetic Score distributions in the HPD dataset generated by our AKRMap and four baselines: PCA, UMAP, t-SNE, and Neuro-Visualizer.
  • ...and 10 more figures