Table of Contents
Fetching ...

Challenges of Multi-Modal Coreset Selection for Depth Prediction

Viktor Moskvoretskii, Narek Alvandian

TL;DR

The paper addresses the challenge of extending coreset selection to multimodal data for depth prediction, showing that unimodal coreset methods poorly capture inter-modal relationships. It adapts a state-of-the-art dataset-quantization coreset approach to multimodal inputs, using multimodal embeddings and a submodular gain to select a representative subset. Evaluations on CLEVR with RGB images and semantic masks using a MultiMAE backbone and DPT adapters reveal that coresets dramatically underperform relative to training on the full data, with only marginal gains from PCA-based reductions and none from UMAP. The results highlight a critical need for specialized multimodal coreset techniques that effectively model inter-modal relationships, and the work provides reproducible code to spur further research.

Abstract

Coreset selection methods are effective in accelerating training and reducing memory requirements but remain largely unexplored in applied multimodal settings. We adapt a state-of-the-art (SoTA) coreset selection technique for multimodal data, focusing on the depth prediction task. Our experiments with embedding aggregation and dimensionality reduction approaches reveal the challenges of extending unimodal algorithms to multimodal scenarios, highlighting the need for specialized methods to better capture inter-modal relationships.

Challenges of Multi-Modal Coreset Selection for Depth Prediction

TL;DR

The paper addresses the challenge of extending coreset selection to multimodal data for depth prediction, showing that unimodal coreset methods poorly capture inter-modal relationships. It adapts a state-of-the-art dataset-quantization coreset approach to multimodal inputs, using multimodal embeddings and a submodular gain to select a representative subset. Evaluations on CLEVR with RGB images and semantic masks using a MultiMAE backbone and DPT adapters reveal that coresets dramatically underperform relative to training on the full data, with only marginal gains from PCA-based reductions and none from UMAP. The results highlight a critical need for specialized multimodal coreset techniques that effectively model inter-modal relationships, and the work provides reproducible code to spur further research.

Abstract

Coreset selection methods are effective in accelerating training and reducing memory requirements but remain largely unexplored in applied multimodal settings. We adapt a state-of-the-art (SoTA) coreset selection technique for multimodal data, focusing on the depth prediction task. Our experiments with embedding aggregation and dimensionality reduction approaches reveal the challenges of extending unimodal algorithms to multimodal scenarios, highlighting the need for specialized methods to better capture inter-modal relationships.

Paper Structure

This paper contains 6 sections, 2 equations, 1 table.