OCC-MLLM:Empowering Multimodal Large Language Model For the Understanding of Occluded Objects

Wenmo Qiu; Xinhan Di

OCC-MLLM:Empowering Multimodal Large Language Model For the Understanding of Occluded Objects

Wenmo Qiu, Xinhan Di

TL;DR

This work addresses the challenge of describing occluded objects in multimodal language models by introducing OCC-MLLM, an occlusion-aware MLLM that fuses a CLIP-based RGB visual encoder with a 3D-SDF occlusion encoder. The model operates on a generative, autoregressive framework with defined input tokens $x^v$ and $x^p$, computes next-token probabilities via $p(x_t|x_{<t})= SoftMax\left( H(h_t) \right)$, and uses beam search decoding to generate descriptions. A key contribution is the dual-encoder design, where $x^v=\alpha x^{v1}+(1-\alpha)x^{v2}$, complemented by an SDF-based occlusion embedding that reconstructs 3D geometry and yields a CLIP-based visual token $x_{v2}$. The authors also release OCC-HO, a 600k-image dataset with occlusion-focused annotations and five QA per image for instruction tuning, and demonstrate that while baseline GPT-4V lags on occluded tasks, fine-tuned MiniGPT-4-V2 and the SDF encoder provide meaningful improvements, underscoring the value of explicit occlusion modeling for VLMs.

Abstract

There is a gap in the understanding of occluded objects in existing large-scale visual language multi-modal models. Current state-of-the-art multimodal models fail to provide satisfactory results in describing occluded objects for visual-language multimodal models through universal visual encoders. Another challenge is the limited number of datasets containing image-text pairs with a large number of occluded objects. Therefore, we introduce a novel multimodal model that applies a newly designed visual encoder to understand occluded objects in RGB images. We also introduce a large-scale visual-language pair dataset for training large-scale visual-language multimodal models and understanding occluded objects. We start our experiments comparing with the state-of-the-art models.

OCC-MLLM:Empowering Multimodal Large Language Model For the Understanding of Occluded Objects

TL;DR

and

, computes next-token probabilities via

, and uses beam search decoding to generate descriptions. A key contribution is the dual-encoder design, where

, complemented by an SDF-based occlusion embedding that reconstructs 3D geometry and yields a CLIP-based visual token

. The authors also release OCC-HO, a 600k-image dataset with occlusion-focused annotations and five QA per image for instruction tuning, and demonstrate that while baseline GPT-4V lags on occluded tasks, fine-tuned MiniGPT-4-V2 and the SDF encoder provide meaningful improvements, underscoring the value of explicit occlusion modeling for VLMs.

Abstract

Paper Structure (16 sections, 5 equations, 3 figures, 2 tables)

This paper contains 16 sections, 5 equations, 3 figures, 2 tables.

Introduction
Method
Formulation of OCC-MLLM Generation
Input Formulation
Model Forward
Decoding
Dual Visual Encoder Module
Visual Embedding For Occluded Objects
Dataset
Dataset Overview
Dataset Annotation
Experiments and Results
Experiments on GPT4vopenai2023gpt
Experiments on MiniGPT4-V2chen2023minigpt)
Experiments on the Proposed SDF Encoderchen2023gsdf
...and 1 more sections

Figures (3)

Figure 1: Overview of the Proposed Multi-Modal Vision-Language Model for the Occluded Objects.
Figure 2: Overview of the proposed second visual encoder reconstruction model $f_{3D}$. This method reconstructs a mesh of realistic subjects and occluded objects from a single RGB image
Figure 3: Custom dataset example. The object is occluded. There are five instructions and five corresponding descriptions.

OCC-MLLM:Empowering Multimodal Large Language Model For the Understanding of Occluded Objects

TL;DR

Abstract

OCC-MLLM:Empowering Multimodal Large Language Model For the Understanding of Occluded Objects

Authors

TL;DR

Abstract

Table of Contents

Figures (3)