Table of Contents
Fetching ...

Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer

Jiaming Lei, Lin Li, Chunping Wang, Jun Xiao, Long Chen

TL;DR

This work tackles zero-shot grounded situation recognition (GSR) by introducing LEX, which injects language-model explainers at each stage of the GSR pipeline to overcome the limits of traditional class-based prompts. LEX employs three explainers—verb explainer, grounding explainer, and noun explainer—to generate richer, context-aware cues: multi-perspective verb descriptions, rephrased grounding templates for precise role localization, and scene-specific noun descriptions for context-consistent noun predictions. A discriminability-based weighting and a global noun refinement strategy enable training-free, plug-and-play integration with vision-language models, and extensive SWiG experiments show significant gains over strong baselines. The approach enhances zero-shot scene understanding with interpretable prompts, improving generalization to unseen actions and complex scenes in real-world settings.

Abstract

Benefiting from strong generalization ability, pre-trained vision language models (VLMs), e.g., CLIP, have been widely utilized in zero-shot scene understanding. Unlike simple recognition tasks, grounded situation recognition (GSR) requires the model not only to classify salient activity (verb) in the image, but also to detect all semantic roles that participate in the action. This complex task usually involves three steps: verb recognition, semantic role grounding, and noun recognition. Directly employing class-based prompts with VLMs and grounding models for this task suffers from several limitations, e.g., it struggles to distinguish ambiguous verb concepts, accurately localize roles with fixed verb-centric template1 input, and achieve context-aware noun predictions. In this paper, we argue that these limitations stem from the mode's poor understanding of verb/noun classes. To this end, we introduce a new approach for zero-shot GSR via Language EXplainer (LEX), which significantly boosts the model's comprehensive capabilities through three explainers: 1) verb explainer, which generates general verb-centric descriptions to enhance the discriminability of different verb classes; 2) grounding explainer, which rephrases verb-centric templates for clearer understanding, thereby enhancing precise semantic role localization; and 3) noun explainer, which creates scene-specific noun descriptions to ensure context-aware noun recognition. By equipping each step of the GSR process with an auxiliary explainer, LEX facilitates complex scene understanding in real-world scenarios. Our extensive validations on the SWiG dataset demonstrate LEX's effectiveness and interoperability in zero-shot GSR.

Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer

TL;DR

This work tackles zero-shot grounded situation recognition (GSR) by introducing LEX, which injects language-model explainers at each stage of the GSR pipeline to overcome the limits of traditional class-based prompts. LEX employs three explainers—verb explainer, grounding explainer, and noun explainer—to generate richer, context-aware cues: multi-perspective verb descriptions, rephrased grounding templates for precise role localization, and scene-specific noun descriptions for context-consistent noun predictions. A discriminability-based weighting and a global noun refinement strategy enable training-free, plug-and-play integration with vision-language models, and extensive SWiG experiments show significant gains over strong baselines. The approach enhances zero-shot scene understanding with interpretable prompts, improving generalization to unseen actions and complex scenes in real-world settings.

Abstract

Benefiting from strong generalization ability, pre-trained vision language models (VLMs), e.g., CLIP, have been widely utilized in zero-shot scene understanding. Unlike simple recognition tasks, grounded situation recognition (GSR) requires the model not only to classify salient activity (verb) in the image, but also to detect all semantic roles that participate in the action. This complex task usually involves three steps: verb recognition, semantic role grounding, and noun recognition. Directly employing class-based prompts with VLMs and grounding models for this task suffers from several limitations, e.g., it struggles to distinguish ambiguous verb concepts, accurately localize roles with fixed verb-centric template1 input, and achieve context-aware noun predictions. In this paper, we argue that these limitations stem from the mode's poor understanding of verb/noun classes. To this end, we introduce a new approach for zero-shot GSR via Language EXplainer (LEX), which significantly boosts the model's comprehensive capabilities through three explainers: 1) verb explainer, which generates general verb-centric descriptions to enhance the discriminability of different verb classes; 2) grounding explainer, which rephrases verb-centric templates for clearer understanding, thereby enhancing precise semantic role localization; and 3) noun explainer, which creates scene-specific noun descriptions to ensure context-aware noun recognition. By equipping each step of the GSR process with an auxiliary explainer, LEX facilitates complex scene understanding in real-world scenarios. Our extensive validations on the SWiG dataset demonstrate LEX's effectiveness and interoperability in zero-shot GSR.
Paper Structure (29 sections, 11 equations, 6 figures, 5 tables)

This paper contains 29 sections, 11 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Illustration of the straightforward pipeline for Zero-Shot GSR. 1) Verb Recognition: utilizing VLMs to identify verbs via verb class-based prompts. 2) Semantic Role Grounding: employing grounding models to localize semantic roles based on the verb-centric templatefootnote:template . 3) Noun Recognition: applying VLMs for identifying entities within localized semantic roles through noun class-based comparison.
  • Figure 2: The limitations of class-based prompts for zero-shot GSR. (a) Ambiguous Action Concepts: the verb "studying" is mistakenly identified as "coloring" due to an unclear verb meaning. (b) Constrained Role Grounding: Rigid templates misguide the grounding of "TOOL" in a complex scene. (c) Context-agnostic Noun Prediction: the noun "woman" is incorrectly classified as "hand" without considering semantic role context.
  • Figure 3: The framework of LEX. 1) Verb Recognition via Verb Explainer: generate general verb-centric descriptions to recognize verbs. 2) Role Localization via Grounding Explainer: generate a rephrased verb-centric template to localize semantic roles. 3) Noun Recognition via Noun Explainer: generate scene-specific noun descriptions to predict nouns.
  • Figure 4: An example of using "biting"'s scene text as "image" to calculate the discriminability score.
  • Figure 5: The architecture of noun recognition.
  • ...and 1 more figures