Table of Contents
Fetching ...

SSE: Multimodal Semantic Data Selection and Enrichment for Industrial-scale Data Assimilation

Maying Shen, Nadine Chang, Sifei Liu, Jose M. Alvarez

TL;DR

The paper tackles data-scale challenges in autonomous-vehicle perception by proposing SSE, a framework that performs semantically driven data selection and enrichment without additional labeling. It leverages foundation-model-generated semantic captions to cluster data by high-level scene content, prune visually redundant samples, and identify semantically distant data from a large unlabeled pool for enrichment. Empirical results on multi-camera 3D object detection show that a semantically tuned subset at 70% of the original labeled data maintains near-parity with full data, and enriching this subset can surpass the original performance without increasing dataset size, with notable gains in rare object classes. The approach provides explainability by generating natural-language semantics for each data point and demonstrates that semantic diversity, not merely raw object counts, drives better downstream performance in industrial AV settings.

Abstract

In recent years, the data collected for artificial intelligence has grown to an unmanageable amount. Particularly within industrial applications, such as autonomous vehicles, model training computation budgets are being exceeded while model performance is saturating -- and yet more data continues to pour in. To navigate the flood of data, we propose a framework to select the most semantically diverse and important dataset portion. Then, we further semantically enrich it by discovering meaningful new data from a massive unlabeled data pool. Importantly, we can provide explainability by leveraging foundation models to generate semantics for every data point. We quantitatively show that our Semantic Selection and Enrichment framework (SSE) can a) successfully maintain model performance with a smaller training dataset and b) improve model performance by enriching the smaller dataset without exceeding the original dataset size. Consequently, we demonstrate that semantic diversity is imperative for optimal data selection and model performance.

SSE: Multimodal Semantic Data Selection and Enrichment for Industrial-scale Data Assimilation

TL;DR

The paper tackles data-scale challenges in autonomous-vehicle perception by proposing SSE, a framework that performs semantically driven data selection and enrichment without additional labeling. It leverages foundation-model-generated semantic captions to cluster data by high-level scene content, prune visually redundant samples, and identify semantically distant data from a large unlabeled pool for enrichment. Empirical results on multi-camera 3D object detection show that a semantically tuned subset at 70% of the original labeled data maintains near-parity with full data, and enriching this subset can surpass the original performance without increasing dataset size, with notable gains in rare object classes. The approach provides explainability by generating natural-language semantics for each data point and demonstrates that semantic diversity, not merely raw object counts, drives better downstream performance in industrial AV settings.

Abstract

In recent years, the data collected for artificial intelligence has grown to an unmanageable amount. Particularly within industrial applications, such as autonomous vehicles, model training computation budgets are being exceeded while model performance is saturating -- and yet more data continues to pour in. To navigate the flood of data, we propose a framework to select the most semantically diverse and important dataset portion. Then, we further semantically enrich it by discovering meaningful new data from a massive unlabeled data pool. Importantly, we can provide explainability by leveraging foundation models to generate semantics for every data point. We quantitatively show that our Semantic Selection and Enrichment framework (SSE) can a) successfully maintain model performance with a smaller training dataset and b) improve model performance by enriching the smaller dataset without exceeding the original dataset size. Consequently, we demonstrate that semantic diversity is imperative for optimal data selection and model performance.
Paper Structure (15 sections, 14 figures, 3 tables, 1 algorithm)

This paper contains 15 sections, 14 figures, 3 tables, 1 algorithm.

Figures (14)

  • Figure 1: We introduce our semantic data selection and enrichment framework (SSE) for autonomous vehicles. The framework generates semantic captions for each data point using a foundation model, capturing semantics including scene understanding (e.g., "crowded urban intersection") and crucial object interactions (e.g., "person about to cross in front of car"). (a) To create a compact dataset, we select the most semantically important portions of a curated and labeled dataset, removing visually similar scenes. (b) To enrich the dataset, we identify new important data points, which are semantically distant from our labeled dataset, from a growing unlabeled data pool. (c) With this approach, we maintain downstream 3D object detection performance using only 70% of the labeled dataset, and we can enhance model performance without increasing the original training dataset size by enriching the selected dataset.
  • Figure 2: Examples of semantic selection and enrichment. The "Pruned" samples are visually and semantically similar to the "Selected" samples, not only a visual duplication. The "Added" samples add different semantics to existing data.
  • Figure 3: Semantic description with MLLMs. The highlighted phrases capture the relevant semantics.
  • Figure 4: Number of unique driving video sessions in each cluster formed with different embeddings. Compared to clusters generated from visual embeddings, semantic clusters capture more semantically similar yet visually diverse scenes across sessions.
  • Figure 5: Visualization of samples in one of our semantic clusters. The scenes are visually different but semantically similar (Pedestrians/cyclists near the ego car and likely to cross the street in front).
  • ...and 9 more figures