SSE: Multimodal Semantic Data Selection and Enrichment for Industrial-scale Data Assimilation
Maying Shen, Nadine Chang, Sifei Liu, Jose M. Alvarez
TL;DR
The paper tackles data-scale challenges in autonomous-vehicle perception by proposing SSE, a framework that performs semantically driven data selection and enrichment without additional labeling. It leverages foundation-model-generated semantic captions to cluster data by high-level scene content, prune visually redundant samples, and identify semantically distant data from a large unlabeled pool for enrichment. Empirical results on multi-camera 3D object detection show that a semantically tuned subset at 70% of the original labeled data maintains near-parity with full data, and enriching this subset can surpass the original performance without increasing dataset size, with notable gains in rare object classes. The approach provides explainability by generating natural-language semantics for each data point and demonstrates that semantic diversity, not merely raw object counts, drives better downstream performance in industrial AV settings.
Abstract
In recent years, the data collected for artificial intelligence has grown to an unmanageable amount. Particularly within industrial applications, such as autonomous vehicles, model training computation budgets are being exceeded while model performance is saturating -- and yet more data continues to pour in. To navigate the flood of data, we propose a framework to select the most semantically diverse and important dataset portion. Then, we further semantically enrich it by discovering meaningful new data from a massive unlabeled data pool. Importantly, we can provide explainability by leveraging foundation models to generate semantics for every data point. We quantitatively show that our Semantic Selection and Enrichment framework (SSE) can a) successfully maintain model performance with a smaller training dataset and b) improve model performance by enriching the smaller dataset without exceeding the original dataset size. Consequently, we demonstrate that semantic diversity is imperative for optimal data selection and model performance.
