Table of Contents
Fetching ...

Generative Visual Commonsense Answering and Explaining with Generative Scene Graph Constructing

Fan Yuan, Xiaoyuan Fang, Rong Quan, Jing Li, Wei Bi, Xiaogang Xu, Piji Li

TL;DR

This work tackles Visual Commonsense Reasoning by addressing the lack of grounded scene-graph guidance in existing models. It introduces G2, a Generative Visual Commonsense Answering framework that constructs location-free scene graphs from image patches using LLMs, then jointly generates answers and explanations while automatically selecting informative graph triplets. Empirical results on VCR show strong performance both automatically and via human evaluation, with additional transferability to Visual Genome-based SGG, VQA-X, and e-SNLI-VE tasks. The approach demonstrates that explicit object-relations and robust triplet selection are valuable for grounded, explainable multimodal reasoning.

Abstract

Visual Commonsense Reasoning, which is regarded as one challenging task to pursue advanced visual scene comprehension, has been used to diagnose the reasoning ability of AI systems. However, reliable reasoning requires a good grasp of the scene's details. Existing work fails to effectively exploit the real-world object relationship information present within the scene, and instead overly relies on knowledge from training memory. Based on these observations, we propose a novel scene-graph-enhanced visual commonsense reasoning generation method named \textit{\textbf{G2}}, which first utilizes the image patches and LLMs to construct a location-free scene graph, and then answer and explain based on the scene graph's information. We also propose automatic scene graph filtering and selection strategies to absorb valuable scene graph information during training. Extensive experiments are conducted on the tasks and datasets of scene graph constructing and visual commonsense answering and explaining, respectively. Experimental results and ablation analysis demonstrate the effectiveness of our proposed framework.

Generative Visual Commonsense Answering and Explaining with Generative Scene Graph Constructing

TL;DR

This work tackles Visual Commonsense Reasoning by addressing the lack of grounded scene-graph guidance in existing models. It introduces G2, a Generative Visual Commonsense Answering framework that constructs location-free scene graphs from image patches using LLMs, then jointly generates answers and explanations while automatically selecting informative graph triplets. Empirical results on VCR show strong performance both automatically and via human evaluation, with additional transferability to Visual Genome-based SGG, VQA-X, and e-SNLI-VE tasks. The approach demonstrates that explicit object-relations and robust triplet selection are valuable for grounded, explainable multimodal reasoning.

Abstract

Visual Commonsense Reasoning, which is regarded as one challenging task to pursue advanced visual scene comprehension, has been used to diagnose the reasoning ability of AI systems. However, reliable reasoning requires a good grasp of the scene's details. Existing work fails to effectively exploit the real-world object relationship information present within the scene, and instead overly relies on knowledge from training memory. Based on these observations, we propose a novel scene-graph-enhanced visual commonsense reasoning generation method named \textit{\textbf{G2}}, which first utilizes the image patches and LLMs to construct a location-free scene graph, and then answer and explain based on the scene graph's information. We also propose automatic scene graph filtering and selection strategies to absorb valuable scene graph information during training. Extensive experiments are conducted on the tasks and datasets of scene graph constructing and visual commonsense answering and explaining, respectively. Experimental results and ablation analysis demonstrate the effectiveness of our proposed framework.
Paper Structure (23 sections, 7 equations, 8 figures, 5 tables)

This paper contains 23 sections, 7 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: An example of visual commonsense reasoning. (a) is the image of the scene. (b) is a scene graph generated from (a). In the table of (c), the first row is the Question, the second row is the Answer and Explanation (AE) without scene graph, the third row is the Answer and Explanation (AE) with scene graph, and the fourth row is the Ground Truth.
  • Figure 2: An overview of our proposed G2. It first generates a scene graph based on the patch sequence and object prompt of the image. Then, combined with the question and image, it automatically selects the scene graph during training, and then generates answers and explanations that are consistent with commonsense.
  • Figure 3: Human evaluation of the ground-truth explanations for the VCR dataset.Filtered refers to the evaluation with correct answers. Unified refers to the assessment of both answers and explanations.
  • Figure 4: Case study of G2 on VCR dataset. "Q", "SG", "A+R", and "GT" denote the question, generated scene graph, predictive answer and rationale, and ground truth answer & explanation, respectively.
  • Figure 5: Representative visualization cases of the proposed G2. "G2 w/o SG", "G2", and "GT" denote the answer and explanation of G2 without scene graph, G2, and ground truth respectively.
  • ...and 3 more figures