Table of Contents
Fetching ...

VLMine: Long-Tail Data Mining with Vision Language Models

Mao Ye, Gregory P. Meyer, Zaiwei Zhang, Dennis Park, Siva Karthik Mustikovela, Yuning Chai, Eric M Wolff

TL;DR

This work finds that the VLM offers a distinct signal for identifying long-tail examples when compared to conventional methods based on model uncertainty, and proposes a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model.

Abstract

Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM). Our approach utilizes a VLM to summarize the content of an image into a set of keywords, and we identify rare examples based on keyword frequency. We find that the VLM offers a distinct signal for identifying long-tail examples when compared to conventional methods based on model uncertainty. Therefore, we propose a simple and general approach for integrating signals from multiple mining algorithms. We evaluate the proposed method on two diverse tasks: 2D image classification, in which inter-class variation is the primary source of data diversity, and on 3D object detection, where intra-class variation is the main concern. Furthermore, through the detection task, we demonstrate that the knowledge extracted from 2D images is transferable to the 3D domain. Our experiments consistently show large improvements (between 10\% and 50\%) over the baseline techniques on several representative benchmarks: ImageNet-LT, Places-LT, and the Waymo Open Dataset.

VLMine: Long-Tail Data Mining with Vision Language Models

TL;DR

This work finds that the VLM offers a distinct signal for identifying long-tail examples when compared to conventional methods based on model uncertainty, and proposes a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model.

Abstract

Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM). Our approach utilizes a VLM to summarize the content of an image into a set of keywords, and we identify rare examples based on keyword frequency. We find that the VLM offers a distinct signal for identifying long-tail examples when compared to conventional methods based on model uncertainty. Therefore, we propose a simple and general approach for integrating signals from multiple mining algorithms. We evaluate the proposed method on two diverse tasks: 2D image classification, in which inter-class variation is the primary source of data diversity, and on 3D object detection, where intra-class variation is the main concern. Furthermore, through the detection task, we demonstrate that the knowledge extracted from 2D images is transferable to the 3D domain. Our experiments consistently show large improvements (between 10\% and 50\%) over the baseline techniques on several representative benchmarks: ImageNet-LT, Places-LT, and the Waymo Open Dataset.
Paper Structure (27 sections, 3 equations, 12 figures, 1 table, 1 algorithm)

This paper contains 27 sections, 3 equations, 12 figures, 1 table, 1 algorithm.

Figures (12)

  • Figure 1: An overview of our proposed method. First, we prompt a VLM to describe the image, then the descriptions are summarized into a set of representative keywords using a rule-based heuristic or LLM. The frequency of the keywords are used to score the novelty of the images. Afterwards, the score can be combined with other long-tail signals, and our proposed Pareto mining is used to select the long-tail data to be labeled.
  • Figure 2: Illustration of the Pareto examples identified by Pareto mining.
  • Figure 3: Data mining experiments on ImageNet-LT and Places-LT.
  • Figure 4: Distribution of the mined data sorted by rareness (the bars further to the left represent rare classes while the bars on the right correspond to more common classes). For readability, we only show classes that have less than 50 images in the original "labeled" pool. The rareness of each class is quantified by the number of images for the class in the original "labeled" pool. We plot the frequency of mined examples for different rarenesses.
  • Figure 5: Correlation of the novelty scores from different algorithms on ImageNet-LT. We plot the scores from three different algorithms for each example in the unlabeled pool and project the scores to show the correlation between each pair of algorithms. As we can see, the scores between predictive entropy and variational ratio are highly correlated, while VLMine provides orthogonal signals.
  • ...and 7 more figures