Table of Contents
Fetching ...

CL3DOR: Contrastive Learning for 3D Large Multimodal Models via Odds Ratio on High-Resolution Point Clouds

Keonwoo Kim, Yeongjae Cho, Taebaek Hwang, Minsoo Jo, Sangdo Han

TL;DR

CL3DOR advances 3D large multimodal modeling by addressing data quality bottlenecks through high-resolution point clouds and plausible hard negatives. It introduces a three-stage training paradigm that aligns 3D objects and scenes before applying spatial contrastive instruction tuning, augmented with an odds-ratio based auxiliary loss. Empirical results on multiple 3D scene understanding and reasoning benchmarks demonstrate state-of-the-art performance and show the critical importance of data quality, hard negatives, and the OR objective. The work offers practical guidance for building more reliable spatial reasoning in 3D LMMs and contributes dataset refinements and prompts to support future research.

Abstract

Recent research has demonstrated that Large Language Models (LLMs) are not limited to text-only tasks but can also function as multimodal models across various modalities, including audio, images, and videos. In particular, research on 3D Large Multimodal Models (3D LMMs) is making notable strides, driven by the potential of processing higher-dimensional data like point clouds. However, upon closer examination, we find that the visual and textual content within each sample of existing training datasets lacks both high informational granularity and clarity, which serve as a bottleneck for precise cross-modal understanding. To address these issues, we propose CL3DOR, Contrastive Learning for 3D large multimodal models via Odds ratio on high-Resolution point clouds, designed to ensure greater specificity and clarity in both visual and textual content. Specifically, we increase the density of point clouds per object and construct informative hard negative responses in the training dataset to penalize unwanted responses. To leverage hard negative responses, we incorporate the odds ratio as an auxiliary term for contrastive learning into the conventional language modeling loss. CL3DOR achieves state-of-the-art performance in 3D scene understanding and reasoning benchmarks. Additionally, we demonstrate the effectiveness of CL3DOR's key components through extensive experiments.

CL3DOR: Contrastive Learning for 3D Large Multimodal Models via Odds Ratio on High-Resolution Point Clouds

TL;DR

CL3DOR advances 3D large multimodal modeling by addressing data quality bottlenecks through high-resolution point clouds and plausible hard negatives. It introduces a three-stage training paradigm that aligns 3D objects and scenes before applying spatial contrastive instruction tuning, augmented with an odds-ratio based auxiliary loss. Empirical results on multiple 3D scene understanding and reasoning benchmarks demonstrate state-of-the-art performance and show the critical importance of data quality, hard negatives, and the OR objective. The work offers practical guidance for building more reliable spatial reasoning in 3D LMMs and contributes dataset refinements and prompts to support future research.

Abstract

Recent research has demonstrated that Large Language Models (LLMs) are not limited to text-only tasks but can also function as multimodal models across various modalities, including audio, images, and videos. In particular, research on 3D Large Multimodal Models (3D LMMs) is making notable strides, driven by the potential of processing higher-dimensional data like point clouds. However, upon closer examination, we find that the visual and textual content within each sample of existing training datasets lacks both high informational granularity and clarity, which serve as a bottleneck for precise cross-modal understanding. To address these issues, we propose CL3DOR, Contrastive Learning for 3D large multimodal models via Odds ratio on high-Resolution point clouds, designed to ensure greater specificity and clarity in both visual and textual content. Specifically, we increase the density of point clouds per object and construct informative hard negative responses in the training dataset to penalize unwanted responses. To leverage hard negative responses, we incorporate the odds ratio as an auxiliary term for contrastive learning into the conventional language modeling loss. CL3DOR achieves state-of-the-art performance in 3D scene understanding and reasoning benchmarks. Additionally, we demonstrate the effectiveness of CL3DOR's key components through extensive experiments.
Paper Structure (45 sections, 5 equations, 6 figures, 18 tables)

This paper contains 45 sections, 5 equations, 6 figures, 18 tables.

Figures (6)

  • Figure 1: Comparison of training methods for 3D LMMs. The upper side shows existing 3D LMMs trained with sparse (low-resolution) point clouds and positive labels, aiming to maximize logits for correct responses. The lower side illustrates the CL3DOR, which uses dense (high-resolution) point cloud input and incorporates both positive and negative labels, employing a contrastive learning approach to explicitly leverage negative response.
  • Figure 2: (a) Examples of the three-stage training datasets used in CL3DOR. Notably, only Stage 3 includes two types of responses following an instruction for contrastive learning. (b) The process of generating hard negative responses for the 3D question answering task. We use GPT-4o to create plausible hard negatives, referencing a top-view image and scene objects
  • Figure 3: Illustration of the proposed CL3DOR. The figure visualizes spatial contrastive instruction tuning. The objective function incorporates an odds ratio loss as an auxiliary term alongside the commonly used NLL loss for language modeling.
  • Figure 4: Resolution in point clouds: high-resolution (left) with 8,192 point clouds per object, and low-resolution (right) with 1,024 point clouds per object.
  • Figure 5: Impact of $\lambda$ on Performance. Plots (a) and (b) illustrate the CIDEr and EM@1-refined scores for ScanQA, while plots (c) and (d) depict the CIDEr and sentence similarity (Sim) scores for Scan2Cap across varying $\lambda$ values. These results emphasize the crucial role of the OR term, $\lambda$, in enhancing performance, particularly when contrasted with the baseline of supervised fine-tuning (SFT) using only positive responses ($\lambda = 0$). The dashed red line represents the performance of CL3DOR when utilizing the probability ratio of sequence likelihood as an auxiliary term ($\lambda =$ 3e-1) instead of the odds ratio.
  • ...and 1 more figures