Table of Contents
Fetching ...

Formula-Supervised Visual-Geometric Pre-training

Ryosuke Yamada, Kensho Hara, Hirokatsu Kataoka, Koshi Makihara, Nakamasa Inoue, Rio Yokota, Yutaka Satoh

TL;DR

The paper addresses the gap between visual and geometric representation learning by proposing Formula-Supervised Visual-Geometric Pre-training (FSVGP), which pre-trains a unified transformer on synthetic, formula-driven data. It introduces VG-FractalDB, pairing fractal images and fractal point clouds with formula-supervised consistency labels to enable cross-modal supervision without real data or annotations. The approach shows competitive improvements across six tasks in image and 3D object recognition, often surpassing prior FDSL methods and offering robust cross-modal capabilities, though it may not always exceed state-of-the-art SL/SSL baselines. By leveraging fractal geometry and synthetic supervision, FSVGP demonstrates the potential of synthetic pre-training to reduce data- and ethics-related concerns while enabling joint visual-geometric understanding for downstream tasks.

Abstract

Throughout the history of computer vision, while research has explored the integration of images (visual) and point clouds (geometric), many advancements in image and 3D object recognition have tended to process these modalities separately. We aim to bridge this divide by integrating images and point clouds on a unified transformer model. This approach integrates the modality-specific properties of images and point clouds and achieves fundamental downstream tasks in image and 3D object recognition on a unified transformer model by learning visual-geometric representations. In this work, we introduce Formula-Supervised Visual-Geometric Pre-training (FSVGP), a novel synthetic pre-training method that automatically generates aligned synthetic images and point clouds from mathematical formulas. Through cross-modality supervision, we enable supervised pre-training between visual and geometric modalities. FSVGP also reduces reliance on real data collection, cross-modality alignment, and human annotation. Our experimental results show that FSVGP pre-trains more effectively than VisualAtom and PC-FractalDB across six tasks: image and 3D object classification, detection, and segmentation. These achievements demonstrate FSVGP's superior generalization in image and 3D object recognition and underscore the potential of synthetic pre-training in visual-geometric representation learning. Our project website is available at https://ryosuke-yamada.github.io/fdsl-fsvgp/.

Formula-Supervised Visual-Geometric Pre-training

TL;DR

The paper addresses the gap between visual and geometric representation learning by proposing Formula-Supervised Visual-Geometric Pre-training (FSVGP), which pre-trains a unified transformer on synthetic, formula-driven data. It introduces VG-FractalDB, pairing fractal images and fractal point clouds with formula-supervised consistency labels to enable cross-modal supervision without real data or annotations. The approach shows competitive improvements across six tasks in image and 3D object recognition, often surpassing prior FDSL methods and offering robust cross-modal capabilities, though it may not always exceed state-of-the-art SL/SSL baselines. By leveraging fractal geometry and synthetic supervision, FSVGP demonstrates the potential of synthetic pre-training to reduce data- and ethics-related concerns while enabling joint visual-geometric understanding for downstream tasks.

Abstract

Throughout the history of computer vision, while research has explored the integration of images (visual) and point clouds (geometric), many advancements in image and 3D object recognition have tended to process these modalities separately. We aim to bridge this divide by integrating images and point clouds on a unified transformer model. This approach integrates the modality-specific properties of images and point clouds and achieves fundamental downstream tasks in image and 3D object recognition on a unified transformer model by learning visual-geometric representations. In this work, we introduce Formula-Supervised Visual-Geometric Pre-training (FSVGP), a novel synthetic pre-training method that automatically generates aligned synthetic images and point clouds from mathematical formulas. Through cross-modality supervision, we enable supervised pre-training between visual and geometric modalities. FSVGP also reduces reliance on real data collection, cross-modality alignment, and human annotation. Our experimental results show that FSVGP pre-trains more effectively than VisualAtom and PC-FractalDB across six tasks: image and 3D object classification, detection, and segmentation. These achievements demonstrate FSVGP's superior generalization in image and 3D object recognition and underscore the potential of synthetic pre-training in visual-geometric representation learning. Our project website is available at https://ryosuke-yamada.github.io/fdsl-fsvgp/.
Paper Structure (27 sections, 1 equation, 7 figures, 19 tables)

This paper contains 27 sections, 1 equation, 7 figures, 19 tables.

Figures (7)

  • Figure 1: FSVGP enables pre-training visual and geometric modalities on a unified transformer model by constructing VG-FractalDB from a mathematical formula. VG-FractalDB consists of fractal images, fractal point clouds, and cross-modal supervision called formula-supervised consistency labels. FSVGP simultaneously inputs a fractal image and a fractal point cloud and pre-trains in classification (CLS) tasks based on a formula-supervised consistency label. We show that FSVGP improves six tasks of image and 3D object CLS, detection (DET), and segmentation (SEG).
  • Figure 2: Overview of the fractal generation process and VG-FractalDB. The fractal generation process creates paired fractal data and formula-supervised consistency labels. Initially, fractal point clouds are generated using the 3D Iterated Function System (3D-IFS). The fractal point clouds are then projected onto 2D planes to form fractal images. Simultaneously, formula-supervised consistency labels are automatically generated based on the variance of 3D coordinates, serving as cross-modality supervision. We construct the VG-FractalDB by repeating these generations.
  • Figure 3: VG-FractalDB pre-training.Left: We trains VG-FractalDB on a unified transformer model. After pre-training, we can fine-tune the image and 3D object recognition by using the same unified transformer model. Right: FSVGP learns visual and geometric modalities by supervised pre-training based on a formula-supervised consistency label. Therefore, FSVGP can train different modalities within a common label space on a unified transformer model.
  • Figure A: The examples of image and point cloud pair data in ShapeNet.
  • Figure B: The examples of image and point cloud pair data in the Visual-Geometric Perlin Noise dataset.
  • ...and 2 more figures