Table of Contents
Fetching ...

Integrating Multimodal Large Language Model Knowledge into Amodal Completion

Heecheol Yun, Eunho Yang

Abstract

With the widespread adoption of autonomous vehicles and robotics, amodal completion, which reconstructs the occluded parts of people and objects in an image, has become increasingly crucial. Just as humans infer hidden regions based on prior experience and common sense, this task inherently requires physical knowledge about real-world entities. However, existing approaches either depend solely on the image generation ability of visual generative models, which lack such knowledge, or leverage it only during the segmentation stage, preventing it from explicitly guiding the completion process. To address this, we propose AmodalCG, a novel framework that harnesses the real-world knowledge of Multimodal Large Language Models (MLLMs) to guide amodal completion. Our framework first assesses the extent of occlusion to selectively invoke MLLM guidance only when the target object is heavily occluded. If guidance is required, the framework further incorporates MLLMs to reason about both the (1) extent and (2) content of the missing regions. Finally, a visual generative model integrates these guidance and iteratively refines imperfect completions that may arise from inaccurate MLLM guidance. Experimental results on various real-world images show impressive improvements compared to all existing works, suggesting MLLMs as a promising direction for addressing challenging amodal completion.

Integrating Multimodal Large Language Model Knowledge into Amodal Completion

Abstract

With the widespread adoption of autonomous vehicles and robotics, amodal completion, which reconstructs the occluded parts of people and objects in an image, has become increasingly crucial. Just as humans infer hidden regions based on prior experience and common sense, this task inherently requires physical knowledge about real-world entities. However, existing approaches either depend solely on the image generation ability of visual generative models, which lack such knowledge, or leverage it only during the segmentation stage, preventing it from explicitly guiding the completion process. To address this, we propose AmodalCG, a novel framework that harnesses the real-world knowledge of Multimodal Large Language Models (MLLMs) to guide amodal completion. Our framework first assesses the extent of occlusion to selectively invoke MLLM guidance only when the target object is heavily occluded. If guidance is required, the framework further incorporates MLLMs to reason about both the (1) extent and (2) content of the missing regions. Finally, a visual generative model integrates these guidance and iteratively refines imperfect completions that may arise from inaccurate MLLM guidance. Experimental results on various real-world images show impressive improvements compared to all existing works, suggesting MLLMs as a promising direction for addressing challenging amodal completion.

Paper Structure

This paper contains 26 sections, 2 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: Our method selectively leverages MLLMs to reason about the extent and content of occluded parts. Incorporating them into amodal completion effectively enhances performance.
  • Figure 2: Completion results when meaningful parts of the target object are occluded. Stable Diffusion (SD) inpainting rombach2022high often generates objects other than the target object. Existing amodal completion methods ozguroglu2024pix2gestaltxu2024amodalao2025open lack an understanding of what should be generated for the missing parts. In contrast, our method provides explicit guidance on what should be reconstructed.
  • Figure 3: Overview of AmodalCG. Our framework first determines which samples would benefit from MLLM guidance (Guidance Decision Module). For those requiring guidance, the MLLM generates two key types of information about the occluded part of the target object: (1) the bounding box size of the full target object (Geometric Guidance Module) and (2) textual descriptions of the occluded region (Semantic Guidance Module). These are then incorporated into the completion process through a multi-scale expansion strategy, which selects the appropriate bounding box scale among the MLLM’s predictions (Inpainting Module).
  • Figure 4: Amodal completion results based on inpainting mask size. Unwanted objects are generated when the inpainting mask is substantially larger than the actual occluded region.
  • Figure 5: Example of the Geometric Guidance Module predicting multi-scale bounding boxes for the full target object.
  • ...and 14 more figures