Table of Contents
Fetching ...

From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge

Somak Aditya, Yezhou Yang, Chitta Baral, Cornelia Fermuller, Yiannis Aloimonos

TL;DR

The paper presents Scene Description Graphs (SDGs) as an intermediate, knowledge-grounded representation that combines deep visual detections with a commonsense knowledge base to generate descriptive sentences. A Bayesian network and a text-derived Knowledge Base are used to infer events, entities, and abstract concepts, enabling robust reasoning beyond flat captions. Extensive AMT evaluations and image-sentence alignment tests indicate SDG-based descriptions are highly relevant and thorough, with competitive image retrieval performance against state-of-the-art captioning methods. The approach offers a scalable framework for vision-language grounding, explanation, and question-answering over visual scenes.

Abstract

In this paper we propose the construction of linguistic descriptions of images. This is achieved through the extraction of scene description graphs (SDGs) from visual scenes using an automatically constructed knowledge base. SDGs are constructed using both vision and reasoning. Specifically, commonsense reasoning is applied on (a) detections obtained from existing perception methods on given images, (b) a "commonsense" knowledge base constructed using natural language processing of image annotations and (c) lexical ontological knowledge from resources such as WordNet. Amazon Mechanical Turk(AMT)-based evaluations on Flickr8k, Flickr30k and MS-COCO datasets show that in most cases, sentences auto-constructed from SDGs obtained by our method give a more relevant and thorough description of an image than a recent state-of-the-art image caption based approach. Our Image-Sentence Alignment Evaluation results are also comparable to that of the recent state-of-the art approaches.

From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge

TL;DR

The paper presents Scene Description Graphs (SDGs) as an intermediate, knowledge-grounded representation that combines deep visual detections with a commonsense knowledge base to generate descriptive sentences. A Bayesian network and a text-derived Knowledge Base are used to infer events, entities, and abstract concepts, enabling robust reasoning beyond flat captions. Extensive AMT evaluations and image-sentence alignment tests indicate SDG-based descriptions are highly relevant and thorough, with competitive image retrieval performance against state-of-the-art captioning methods. The approach offers a scalable framework for vision-language grounding, explanation, and question-answering over visual scenes.

Abstract

In this paper we propose the construction of linguistic descriptions of images. This is achieved through the extraction of scene description graphs (SDGs) from visual scenes using an automatically constructed knowledge base. SDGs are constructed using both vision and reasoning. Specifically, commonsense reasoning is applied on (a) detections obtained from existing perception methods on given images, (b) a "commonsense" knowledge base constructed using natural language processing of image annotations and (c) lexical ontological knowledge from resources such as WordNet. Amazon Mechanical Turk(AMT)-based evaluations on Flickr8k, Flickr30k and MS-COCO datasets show that in most cases, sentences auto-constructed from SDGs obtained by our method give a more relevant and thorough description of an image than a recent state-of-the-art image caption based approach. Our Image-Sentence Alignment Evaluation results are also comparable to that of the recent state-of-the art approaches.

Paper Structure

This paper contains 12 sections, 1 equation, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Examples from karpathy2014deep: (a) Positive example annotation: construction worker in orange safety vest is working on road, (b) Negative example annotation: a bunch of bananas are hanging from a ceiling. Such annotations could be infrequent, but it is hard to logically justify such contrasting outputs.
  • Figure 2: Example Image and a possible corresponding SDG. Note, the SDG should contain a similar event wear2 for person2. We omit it for space constraints. Note that, it is easy to augment spatial information to the above graph such as (person1,left,person2).
  • Figure 3: SDG and Sentence Generation through Reasoning using Knowledge Base and a Bayesian Network $\mathcal{B}_{n}$
  • Figure 4: (a) Constructing Knowledge Base From Annotations. (b) A snapshot of the $\mathcal{K}_b$. In this figure, Person and bench are entities, lay is the connecting event. The entity Person can have trait climber. The sub-graph essentially captures the knowledge of the activity person laying on a bench. The figure on the left shows the edge-labels.
  • Figure 5: A subgraph reflecting the dependencies captured in the Learnt Bayesian Network $\mathcal{B}_{n}$
  • ...and 1 more figures