OCC-MLLM:Empowering Multimodal Large Language Model For the Understanding of Occluded Objects
Wenmo Qiu, Xinhan Di
TL;DR
This work addresses the challenge of describing occluded objects in multimodal language models by introducing OCC-MLLM, an occlusion-aware MLLM that fuses a CLIP-based RGB visual encoder with a 3D-SDF occlusion encoder. The model operates on a generative, autoregressive framework with defined input tokens $x^v$ and $x^p$, computes next-token probabilities via $p(x_t|x_{<t})= SoftMax\left( H(h_t) \right)$, and uses beam search decoding to generate descriptions. A key contribution is the dual-encoder design, where $x^v=\alpha x^{v1}+(1-\alpha)x^{v2}$, complemented by an SDF-based occlusion embedding that reconstructs 3D geometry and yields a CLIP-based visual token $x_{v2}$. The authors also release OCC-HO, a 600k-image dataset with occlusion-focused annotations and five QA per image for instruction tuning, and demonstrate that while baseline GPT-4V lags on occluded tasks, fine-tuned MiniGPT-4-V2 and the SDF encoder provide meaningful improvements, underscoring the value of explicit occlusion modeling for VLMs.
Abstract
There is a gap in the understanding of occluded objects in existing large-scale visual language multi-modal models. Current state-of-the-art multimodal models fail to provide satisfactory results in describing occluded objects for visual-language multimodal models through universal visual encoders. Another challenge is the limited number of datasets containing image-text pairs with a large number of occluded objects. Therefore, we introduce a novel multimodal model that applies a newly designed visual encoder to understand occluded objects in RGB images. We also introduce a large-scale visual-language pair dataset for training large-scale visual-language multimodal models and understanding occluded objects. We start our experiments comparing with the state-of-the-art models.
