Table of Contents
Fetching ...

AEye: A Visualization Tool for Image Datasets

Florian Grötschla, Luca A. Lanzendörfer, Marco Calzavara, Roger Wattenhofer

TL;DR

AEye addresses the challenge of understanding large image datasets by offering a scalable visualization that preserves semantic structure through CLIP-based embeddings. The approach combines a 2D projection via UMAP, a layered tiling scheme to display representative images, and semantic search with captions powered by CLIP and LLaVA, all backed by a vector database such as Milvus. Key contributions include the layered tiling method for scalable overview, the integration of text and image semantic search, and the ability to generate AI captions to contextualize images. The tool facilitates bias and imbalance detection and supports data-driven dataset curation, with open-source code and a demonstration site for deployment.

Abstract

Image datasets serve as the foundation for machine learning models in computer vision, significantly influencing model capabilities, performance, and biases alongside architectural considerations. Therefore, understanding the composition and distribution of these datasets has become increasingly crucial. To address the need for intuitive exploration of these datasets, we propose AEye, an extensible and scalable visualization tool tailored to image datasets. AEye utilizes a contrastively trained model to embed images into semantically meaningful high-dimensional representations, facilitating data clustering and organization. To visualize the high-dimensional representations, we project them onto a two-dimensional plane and arrange images in layers so users can seamlessly navigate and explore them interactively. AEye facilitates semantic search functionalities for both text and image queries, enabling users to search for content. We open-source the codebase for AEye, and provide a simple configuration to add datasets.

AEye: A Visualization Tool for Image Datasets

TL;DR

AEye addresses the challenge of understanding large image datasets by offering a scalable visualization that preserves semantic structure through CLIP-based embeddings. The approach combines a 2D projection via UMAP, a layered tiling scheme to display representative images, and semantic search with captions powered by CLIP and LLaVA, all backed by a vector database such as Milvus. Key contributions include the layered tiling method for scalable overview, the integration of text and image semantic search, and the ability to generate AI captions to contextualize images. The tool facilitates bias and imbalance detection and supports data-driven dataset curation, with open-source code and a demonstration site for deployment.

Abstract

Image datasets serve as the foundation for machine learning models in computer vision, significantly influencing model capabilities, performance, and biases alongside architectural considerations. Therefore, understanding the composition and distribution of these datasets has become increasingly crucial. To address the need for intuitive exploration of these datasets, we propose AEye, an extensible and scalable visualization tool tailored to image datasets. AEye utilizes a contrastively trained model to embed images into semantically meaningful high-dimensional representations, facilitating data clustering and organization. To visualize the high-dimensional representations, we project them onto a two-dimensional plane and arrange images in layers so users can seamlessly navigate and explore them interactively. AEye facilitates semantic search functionalities for both text and image queries, enabling users to search for content. We open-source the codebase for AEye, and provide a simple configuration to add datasets.
Paper Structure (9 sections, 4 figures, 1 algorithm)

This paper contains 9 sections, 4 figures, 1 algorithm.

Figures (4)

  • Figure 1: Architecture overview of AEye. Images are embedded with CLIP, stored in a vector database, and projected to a two-dimensional space with a UMAP projection (\ref{['sec:clip_and_umap']}). The resulting positions are used for the visualization and a tiling and clustering module that computes representatives for each layer (\ref{['sec:clustering_and_tiling']}). The semantic search takes text or image queries and uses CLIP to encode them. The vector database is used to find the nearest neighbor in the embedding space (\ref{['sec:search_and_captions']}).
  • Figure 2: Visualization of the tiling hierarchy. Representatives for level 0 (blue dots) are obtained by clustering all points with k-means. Only the points closest to the centroids (crosses) are retained. In the next level, a k-means clustering is computed on every sub-tile, with the restriction of fixed centroids for the positions of representatives from the previous layers. Again, the closest points to the centroids are retained. This process finishes when all points can be kept for one level. Pseudocode can be found in \ref{['alg:tiling']}.
  • Figure 3: AEye view of the MNIST dataset. We observe that numbers are clearly separated by the projected CLIP embeddings, resulting in a meaningful clustering of the dataset. Similarly, the CelebA-HQ dataset shows a clear distinction between men and women.
  • Figure 4: View of the application when searching for "a dog with a horse." The nearest neighbors in the embedding space are presented below the search result. In addition to metadata provided by the dataset, an AI-generated caption of the image is shown.