Table of Contents
Fetching ...

OCC-MLLM-CoT-Alpha: Towards Multi-stage Occlusion Recognition Based on Large Language Models via 3D-Aware Supervision and Chain-of-Thoughts Guidance

Chaoyi Wang, Baoqing Li, Xinhan Di

TL;DR

This work tackles occluded-object understanding in multimodal large language models by introducing OCC-MLLM-CoT-Alpha, a two-stage framework that fuses 3D-aware supervision with multi-modal Chain-of-Thought (CoT) guidance. Stage 1 jointly pre-trains a vision-language model and a 3D reconstruction expert, while Stage 2 trains CoT reasoning through supervised description, self-reflection, and final decisions, augmented by a Mixed Preference Optimization objective. A large-scale CoT-style dataset of occluded-hand object reasoning (104,671 samples) supports learning and evaluation. Results show consistent performance gains across multiple backbones and data scales, with notable improvements in Description, Reflection, and Decision scores, highlighting the value of progressive reasoning and 3D-aware cues for occluded object understanding in multimodal systems.

Abstract

Comprehending occluded objects are not well studied in existing large-scale visual-language multi-modal models. Current state-of-the-art multi-modal large models struggles to provide satisfactory results in understanding occluded objects through universal visual encoders and supervised learning strategies. Therefore, we propose OCC-MLLM-CoT-Alpha, a multi-modal large vision language framework that integrates 3D-aware supervision and Chain-of-Thoughts guidance. Particularly, (1) we build a multi-modal large vision-language model framework which is consisted of a large multi-modal vision-language model and a 3D reconstruction expert model. (2) the corresponding multi-modal Chain-of-Thoughts is learned through a combination of supervised and reinforcement training strategies, allowing the multi-modal vision-language model to enhance the recognition ability with learned multi-modal chain-of-thoughts guidance. (3) A large-scale multi-modal chain-of-thoughts reasoning dataset, consisting of $110k$ samples of occluded objects held in hand, is built. In the evaluation, the proposed methods demonstrate decision score improvement of 15.75%,15.30%,16.98%,14.62%, and 4.42%,3.63%,6.94%,10.70% for two settings of a variety of state-of-the-art models.

OCC-MLLM-CoT-Alpha: Towards Multi-stage Occlusion Recognition Based on Large Language Models via 3D-Aware Supervision and Chain-of-Thoughts Guidance

TL;DR

This work tackles occluded-object understanding in multimodal large language models by introducing OCC-MLLM-CoT-Alpha, a two-stage framework that fuses 3D-aware supervision with multi-modal Chain-of-Thought (CoT) guidance. Stage 1 jointly pre-trains a vision-language model and a 3D reconstruction expert, while Stage 2 trains CoT reasoning through supervised description, self-reflection, and final decisions, augmented by a Mixed Preference Optimization objective. A large-scale CoT-style dataset of occluded-hand object reasoning (104,671 samples) supports learning and evaluation. Results show consistent performance gains across multiple backbones and data scales, with notable improvements in Description, Reflection, and Decision scores, highlighting the value of progressive reasoning and 3D-aware cues for occluded object understanding in multimodal systems.

Abstract

Comprehending occluded objects are not well studied in existing large-scale visual-language multi-modal models. Current state-of-the-art multi-modal large models struggles to provide satisfactory results in understanding occluded objects through universal visual encoders and supervised learning strategies. Therefore, we propose OCC-MLLM-CoT-Alpha, a multi-modal large vision language framework that integrates 3D-aware supervision and Chain-of-Thoughts guidance. Particularly, (1) we build a multi-modal large vision-language model framework which is consisted of a large multi-modal vision-language model and a 3D reconstruction expert model. (2) the corresponding multi-modal Chain-of-Thoughts is learned through a combination of supervised and reinforcement training strategies, allowing the multi-modal vision-language model to enhance the recognition ability with learned multi-modal chain-of-thoughts guidance. (3) A large-scale multi-modal chain-of-thoughts reasoning dataset, consisting of samples of occluded objects held in hand, is built. In the evaluation, the proposed methods demonstrate decision score improvement of 15.75%,15.30%,16.98%,14.62%, and 4.42%,3.63%,6.94%,10.70% for two settings of a variety of state-of-the-art models.

Paper Structure

This paper contains 13 sections, 8 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Step-by-Step Occlusion Reasoning Framework Using Multi-modal LLM with Stepwise Chain-of-Thoughts Guidance for Enhanced Object Recognition.
  • Figure 2: Step-by-Step Occlusion Reasoning Examples: Showcasing the Internal Chain-of-Thought Process.