Table of Contents
Fetching ...

From Semantics, Scene to Instance-awareness: Distilling Foundation Model for Grounded Open-vocabulary Situation Recognition

Chen Cai, Tianyi Liu, Jianjun Gao, Wenyang Liu, Kejun Wu, Ruoyu Wang, Yi Wang, Soo Chin Liew

TL;DR

This paper proposes Multimodal Interactive Prompt Distillation (MIPD), a novel framework that distills enriched multimodal knowledge from the foundation model, enabling the student Ov-GSR model to recognize unseen situations and be better aware of rare situations.

Abstract

Recent Multimodal Large Language Models (MLLMs) exhibit strong zero-shot abilities but struggle with complex Grounded Situation Recognition (GSR) and are resource-intensive for edge device deployment. Meanwhile, conventional GSR models often lack generalization ability, falling short in recognizing unseen and rare situations. In this paper, we exploit transferring knowledge from a teacher MLLM to a small GSR model to enhance its generalization and zero-shot abilities, thereby introducing the task of Open-vocabulary Grounded Situation Recognition (Ov-GSR). To achieve this, we propose Multimodal Interactive Prompt Distillation (MIPD), a novel framework that distills enriched multimodal knowledge from the foundation model, enabling the student Ov-GSR model to recognize unseen situations and be better aware of rare situations. Specifically, the MIPD framework first leverages the LLM-based Judgmental Rationales Generator (JRG) to construct positive and negative glimpse and gaze rationales enriched with contextual semantic information. The proposed scene-aware and instance-perception prompts are then introduced to align rationales with visual information from the MLLM teacher via the Negative-Guided Multimodal Prompting Alignment (NMPA) module, effectively capturing holistic and perceptual multimodal knowledge. Finally, the aligned multimodal knowledge is distilled into the student Ov-GSR model, providing a stronger foundation for generalization that enhances situation understanding, bridges the gap between seen and unseen scenarios, and mitigates prediction bias in rare cases. We evaluate MIPD on the refined Ov-SWiG dataset, achieving superior performance on seen, rare, and unseen situations, and further demonstrate improved unseen detection on the HICO-DET dataset.

From Semantics, Scene to Instance-awareness: Distilling Foundation Model for Grounded Open-vocabulary Situation Recognition

TL;DR

This paper proposes Multimodal Interactive Prompt Distillation (MIPD), a novel framework that distills enriched multimodal knowledge from the foundation model, enabling the student Ov-GSR model to recognize unseen situations and be better aware of rare situations.

Abstract

Recent Multimodal Large Language Models (MLLMs) exhibit strong zero-shot abilities but struggle with complex Grounded Situation Recognition (GSR) and are resource-intensive for edge device deployment. Meanwhile, conventional GSR models often lack generalization ability, falling short in recognizing unseen and rare situations. In this paper, we exploit transferring knowledge from a teacher MLLM to a small GSR model to enhance its generalization and zero-shot abilities, thereby introducing the task of Open-vocabulary Grounded Situation Recognition (Ov-GSR). To achieve this, we propose Multimodal Interactive Prompt Distillation (MIPD), a novel framework that distills enriched multimodal knowledge from the foundation model, enabling the student Ov-GSR model to recognize unseen situations and be better aware of rare situations. Specifically, the MIPD framework first leverages the LLM-based Judgmental Rationales Generator (JRG) to construct positive and negative glimpse and gaze rationales enriched with contextual semantic information. The proposed scene-aware and instance-perception prompts are then introduced to align rationales with visual information from the MLLM teacher via the Negative-Guided Multimodal Prompting Alignment (NMPA) module, effectively capturing holistic and perceptual multimodal knowledge. Finally, the aligned multimodal knowledge is distilled into the student Ov-GSR model, providing a stronger foundation for generalization that enhances situation understanding, bridges the gap between seen and unseen scenarios, and mitigates prediction bias in rare cases. We evaluate MIPD on the refined Ov-SWiG dataset, achieving superior performance on seen, rare, and unseen situations, and further demonstrate improved unseen detection on the HICO-DET dataset.

Paper Structure

This paper contains 18 sections, 8 equations, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: Illustration of conventional closed-set Grounded Situation Recognition (GSR) and the proposed Open-vocabulary GSR (Ov-GSR). (a) Closed-set GSR methods fail to predict unseen activities (e.g., Hugging), resulting in incorrect semantic role recognition. (b) Ov-GSR gains the ability to identify unseen situations through the proposed Multimodal Interactive Prompt Distillation (MIPD) framework. For example, it correctly predicts the salient activity “Hugging,” along with its semantic roles and detects the entities: AgentPart: Paw, Hugged: Dog, Place: Outdoors, and Agent: Cat.
  • Figure 2: The analysis compares inference resource requirements with existing larger models. Ours uses lower memory for deployment and has faster Frame Per Second.
  • Figure 3: Overview of our framework: We first leverage an MLLM guided with (a) instructions to generate pseudo glimpse and gaze rationales for scene and entity understanding. This is followed by the (b) Judgmental Rationales Generator (JRG), which employs an LLM-judge to evaluate and iteratively refine these rationales through multi-round reasoning, resulting in high-quality positive and negative rationales. These rationales are then aligned with scene-aware and instance-perception prompts to encapsulate visual and semantic information from teacher MLLM model through the Negative-Guided Multimodal Prompting Alignment (NMPA) module. Finally, our proposed (c) Multimodal Interactive Prompt Distillation (MIPD) framework distills the aligned multimodal knowledge into the student model, enabling more accurate and generalizable Ov-GSR.
  • Figure 4: Examples of unseen situations (top) and rare situations (bottom). Green is correct predictions, red indicates incorrect ones, and bold colored text highlights our correct predictions with grounding.