Table of Contents
Fetching ...

Efficient Encoder-Free Fourier-based 3D Large Multimodal Model

Guofeng Mei, Wei Lin, Luigi Riz, Yujiao Wu, Yiming Wang, Fabio Poiesi

TL;DR

Fase3D is proposed, the first efficient encoder-free Fourier-based 3D scene LMM that achieves performance comparable to encoder-based 3D LMMs while being significantly more efficient in computation and parameters.

Abstract

Large Multimodal Models (LMMs) that process 3D data typically rely on heavy, pre-trained visual encoders to extract geometric features. While recent 2D LMMs have begun to eliminate such encoders for efficiency and scalability, extending this paradigm to 3D remains challenging due to the unordered and large-scale nature of point clouds. This leaves a critical unanswered question: How can we design an LMM that tokenizes unordered 3D data effectively and efficiently without a cumbersome encoder? We propose Fase3D, the first efficient encoder-free Fourier-based 3D scene LMM. Fase3D tackles the challenges of scalability and permutation invariance with a novel tokenizer that combines point cloud serialization and the Fast Fourier Transform (FFT) to approximate self-attention. This design enables an effective and computationally minimal architecture, built upon three key innovations: First, we represent large scenes compactly via structured superpoints. Second, our space-filling curve serialization followed by an FFT enables efficient global context modeling and graph-based token merging. Lastly, our Fourier-augmented LoRA adapters inject global frequency-aware interactions into the LLMs at a negligible cost. Fase3D achieves performance comparable to encoder-based 3D LMMs while being significantly more efficient in computation and parameters. Project website: https://tev-fbk.github.io/Fase3D.

Efficient Encoder-Free Fourier-based 3D Large Multimodal Model

TL;DR

Fase3D is proposed, the first efficient encoder-free Fourier-based 3D scene LMM that achieves performance comparable to encoder-based 3D LMMs while being significantly more efficient in computation and parameters.

Abstract

Large Multimodal Models (LMMs) that process 3D data typically rely on heavy, pre-trained visual encoders to extract geometric features. While recent 2D LMMs have begun to eliminate such encoders for efficiency and scalability, extending this paradigm to 3D remains challenging due to the unordered and large-scale nature of point clouds. This leaves a critical unanswered question: How can we design an LMM that tokenizes unordered 3D data effectively and efficiently without a cumbersome encoder? We propose Fase3D, the first efficient encoder-free Fourier-based 3D scene LMM. Fase3D tackles the challenges of scalability and permutation invariance with a novel tokenizer that combines point cloud serialization and the Fast Fourier Transform (FFT) to approximate self-attention. This design enables an effective and computationally minimal architecture, built upon three key innovations: First, we represent large scenes compactly via structured superpoints. Second, our space-filling curve serialization followed by an FFT enables efficient global context modeling and graph-based token merging. Lastly, our Fourier-augmented LoRA adapters inject global frequency-aware interactions into the LLMs at a negligible cost. Fase3D achieves performance comparable to encoder-based 3D LMMs while being significantly more efficient in computation and parameters. Project website: https://tev-fbk.github.io/Fase3D.
Paper Structure (37 sections, 20 equations, 6 figures, 10 tables)

This paper contains 37 sections, 20 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Fase3D's contribution overview. Mainstream 3D LMMs are based on computationally-heavy scene encoders to extract geometric features before alignment with the LLM. In contrast, our method (Fase3D) employs a lightweight Fourier-based tokenizer to process raw point clouds directly and introduces Fourier-augmented LoRA adapters, which infuse global frequency-aware context into the LLM without additional computational overhead.
  • Figure 2: The Fase3D pipeline. A lightweight tokenizer ($\bullet$) produces $M$ superpoint tokens, which are refined by an FFT-based context enhancer ($\bullet$). A graph is then constructed, and a token-merging block ($\bullet$) compresses the tokens into $T$ compact 3D tokens ($T < M$). Finally, an LLM ($\bullet$) with an FFT-based global filter ($\bullet$) processes these tokens together with textual and user prompts.
  • Figure 3: Qualitative results and comparisons between Fase3D, PerLA mei2025perla, and LL3DA chen2024ll3da on the ScanQA azuma2022scanqa dataset.
  • Figure 4: Qualitative comparison between Fase3D, PerLA mei2025perla, and LL3DA chen2024ll3da on the ScanRefer chen2020scanrefer dataset.
  • Figure 5: Visualization of our SFC-based $k$NN graph construction via window voting. We show two representative examples of curve-guided neighbor selection.
  • ...and 1 more figures