OCC-MLLM-Alpha:Empowering Multi-modal Large Language Model for the Understanding of Occluded Objects with Self-Supervised Test-Time Learning

Shuxin Yang; Xinhan Di

OCC-MLLM-Alpha:Empowering Multi-modal Large Language Model for the Understanding of Occluded Objects with Self-Supervised Test-Time Learning

Shuxin Yang, Xinhan Di

TL;DR

A multi-modal large language framework and corresponding self-supervised learning strategy with support of 3D generation is introduced and initial results demonstrate the improvement of 16.92% in comparison with the state-of-the-art VLM models.

Abstract

There is a gap in the understanding of occluded objects in existing large-scale visual language multi-modal models. Current state-of-the-art multi-modal models fail to provide satisfactory results in describing occluded objects through universal visual encoders and supervised learning strategies. Therefore, we introduce a multi-modal large language framework and corresponding self-supervised learning strategy with support of 3D generation. We start our experiments comparing with the state-of-the-art models in the evaluation of a large-scale dataset SOMVideo [18]. The initial results demonstrate the improvement of 16.92% in comparison with the state-of-the-art VLM models.

OCC-MLLM-Alpha:Empowering Multi-modal Large Language Model for the Understanding of Occluded Objects with Self-Supervised Test-Time Learning

TL;DR

Abstract

Paper Structure (17 sections, 6 equations, 3 figures, 2 tables)

This paper contains 17 sections, 6 equations, 3 figures, 2 tables.

Introduction
Method
Formulation of OCC-MLLM-Alpha Generation
Input Formulation
Model Forward
Decoding
Dual Visual Encoder Module
Visual Embedding For Occluded Objects
Test-Time Adaption Based on Self-Supervised Learning.
Multi-stage Leaning Strategy.
Dataset
Dataset Overview
Experiments and Results
Experiments on GPT4o gpt4o
Experiments on Mini-Gemini li2024mgm
...and 2 more sections

Figures (3)

Figure 1: Overview of the Proposed Multi-Modal Vision-Language Model for the Occluded Objects with Self-Supervised Test-Time Learning.
Figure 2: Overview of the proposed second 3D reconstruction module $f_{3D}$. This method reconstructs a mesh of occluded objects from a single RGB image
Figure 3: Dataset example. The object is occluded. There are five instructions and five corresponding descriptions.

OCC-MLLM-Alpha:Empowering Multi-modal Large Language Model for the Understanding of Occluded Objects with Self-Supervised Test-Time Learning

TL;DR

Abstract

OCC-MLLM-Alpha:Empowering Multi-modal Large Language Model for the Understanding of Occluded Objects with Self-Supervised Test-Time Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (3)