Table of Contents
Fetching ...

Text-guided Synthetic Geometric Augmentation for Zero-shot 3D Understanding

Kohei Torimi, Ryosuke Yamada, Daichi Otsuka, Kensho Hara, Yuki M. Asano, Hirokatsu Kataoka, Yoshimitsu Aoki

TL;DR

It is demonstrated that TeGA effectively bridges the 3D data gap, enabling robust zero-shot 3D classification even with limited real training data and paving the way for zero-shot 3D vision application.

Abstract

Zero-shot recognition models require extensive training data for generalization. However, in zero-shot 3D classification, collecting 3D data and captions is costly and laborintensive, posing a significant barrier compared to 2D vision. Recent advances in generative models have achieved unprecedented realism in synthetic data production, and recent research shows the potential for using generated data as training data. Here, naturally raising the question: Can synthetic 3D data generated by generative models be used as expanding limited 3D datasets? In response, we present a synthetic 3D dataset expansion method, Textguided Geometric Augmentation (TeGA). TeGA is tailored for language-image-3D pretraining, which achieves SoTA in zero-shot 3D classification, and uses a generative textto-3D model to enhance and extend limited 3D datasets. Specifically, we automatically generate text-guided synthetic 3D data and introduce a consistency filtering strategy to discard noisy samples where semantics and geometric shapes do not match with text. In the experiment to double the original dataset size using TeGA, our approach demonstrates improvements over the baselines, achieving zeroshot performance gains of 3.0% on Objaverse-LVIS, 4.6% on ScanObjectNN, and 8.7% on ModelNet40. These results demonstrate that TeGA effectively bridges the 3D data gap, enabling robust zero-shot 3D classification even with limited real training data and paving the way for zero-shot 3D vision application.

Text-guided Synthetic Geometric Augmentation for Zero-shot 3D Understanding

TL;DR

It is demonstrated that TeGA effectively bridges the 3D data gap, enabling robust zero-shot 3D classification even with limited real training data and paving the way for zero-shot 3D vision application.

Abstract

Zero-shot recognition models require extensive training data for generalization. However, in zero-shot 3D classification, collecting 3D data and captions is costly and laborintensive, posing a significant barrier compared to 2D vision. Recent advances in generative models have achieved unprecedented realism in synthetic data production, and recent research shows the potential for using generated data as training data. Here, naturally raising the question: Can synthetic 3D data generated by generative models be used as expanding limited 3D datasets? In response, we present a synthetic 3D dataset expansion method, Textguided Geometric Augmentation (TeGA). TeGA is tailored for language-image-3D pretraining, which achieves SoTA in zero-shot 3D classification, and uses a generative textto-3D model to enhance and extend limited 3D datasets. Specifically, we automatically generate text-guided synthetic 3D data and introduce a consistency filtering strategy to discard noisy samples where semantics and geometric shapes do not match with text. In the experiment to double the original dataset size using TeGA, our approach demonstrates improvements over the baselines, achieving zeroshot performance gains of 3.0% on Objaverse-LVIS, 4.6% on ScanObjectNN, and 8.7% on ModelNet40. These results demonstrate that TeGA effectively bridges the 3D data gap, enabling robust zero-shot 3D classification even with limited real training data and paving the way for zero-shot 3D vision application.
Paper Structure (14 sections, 7 equations, 8 figures, 10 tables)

This paper contains 14 sections, 7 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Our proposed TeGA (Text-guided Geometric Augmentation) assigns text guidance and a generative text-to-3D model for high-efficient dataset expansion which dramatically augments limited real data. Although we employ existing methods (e.g., Point-E) and simple tricks within text prompting, the proposal performs enough noteworthy results that 3D dataset with add synthetic data with TeGA outperforms ShapeNet trained model on e.g., Objaverse-LVIS, ModenNet-40 and ScanObjectNN under the setting of zero-shot 3D classification.
  • Figure 2: A visualization of synthetic 3D data generated from real 3D data and Point-E. Synthetic 3D data shows that it is more difficult to generate detailed geometrical detail compared to real data.
  • Figure 3: A visualization of consistency filtering. The upper shows samples which passed filter; the lower shows samples which filtered out. Our filtering can detect error cases while generation process.
  • Figure 4: The overview of consistency filtering process. The purpose of this process is to remove misaligned data that may introduce model collapse during training. Specifically, rendered multi-view images are input into BLIP to generate captions. Then, the captions are summarized to one caption by GPT-4. Finally, the quality of the generated data is evaluated by comparing the text used for generation with the generated captions through two matching methods: word-level matching and concept-level matching.
  • Figure 5: Confusion matrics when varying PE/SP. When the proportion of real data decreases, the model's predictions increasingly skew towards desk and display, leading to an overall deterioration in prediction accuracy.
  • ...and 3 more figures