Table of Contents
Fetching ...

LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation

Kibum Kim, Kanghoon Yoon, Jaehyeong Jeon, Yeonjun In, Jinyoung Moon, Donghyun Kim, Chanyoung Park

TL;DR

This work tackles weakly supervised scene graph generation by addressing two bottlenecks in caption-based triplet formation: semantic over-simplification and low-density supervision. It introduces LLM4SGG, which uses two Chain-of-Thought prompted LLM chains with in-context few-shot learning to (i) extract richer triplets from both original and paraphrased captions and (ii) align subjects, objects, and predicates to target lexicons, producing high-quality unlocalized triplets. These triplets are then grounded with state-of-the-art methods (SGNLS or VS^3) to generate pseudo-labels for training SGG models, yielding substantial gains on Visual Genome and GQA, notably improving $mR@K$ and demonstrating data-efficient training. The approach shows superior performance over strong baselines, reduces zero-frequency predicates, and provides a practical preprocessing pipeline; it also highlights limitations related to reliance on a proprietary LLM and suggests avenues for grounding with LLMs and exploring smaller models.

Abstract

Weakly-Supervised Scene Graph Generation (WSSGG) research has recently emerged as an alternative to the fully-supervised approach that heavily relies on costly annotations. In this regard, studies on WSSGG have utilized image captions to obtain unlocalized triplets while primarily focusing on grounding the unlocalized triplets over image regions. However, they have overlooked the two issues involved in the triplet formation process from the captions: 1) Semantic over-simplification issue arises when extracting triplets from captions, where fine-grained predicates in captions are undesirably converted into coarse-grained predicates, resulting in a long-tailed predicate distribution, and 2) Low-density scene graph issue arises when aligning the triplets in the caption with entity/predicate classes of interest, where many triplets are discarded and not used in training, leading to insufficient supervision. To tackle the two issues, we propose a new approach, i.e., Large Language Model for weakly-supervised SGG (LLM4SGG), where we mitigate the two issues by leveraging the LLM's in-depth understanding of language and reasoning ability during the extraction of triplets from captions and alignment of entity/predicate classes with target data. To further engage the LLM in these processes, we adopt the idea of Chain-of-Thought and the in-context few-shot learning strategy. To validate the effectiveness of LLM4SGG, we conduct extensive experiments on Visual Genome and GQA datasets, showing significant improvements in both Recall@K and mean Recall@K compared to the state-of-the-art WSSGG methods. A further appeal is that LLM4SGG is data-efficient, enabling effective model training with a small amount of training images.

LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation

TL;DR

This work tackles weakly supervised scene graph generation by addressing two bottlenecks in caption-based triplet formation: semantic over-simplification and low-density supervision. It introduces LLM4SGG, which uses two Chain-of-Thought prompted LLM chains with in-context few-shot learning to (i) extract richer triplets from both original and paraphrased captions and (ii) align subjects, objects, and predicates to target lexicons, producing high-quality unlocalized triplets. These triplets are then grounded with state-of-the-art methods (SGNLS or VS^3) to generate pseudo-labels for training SGG models, yielding substantial gains on Visual Genome and GQA, notably improving and demonstrating data-efficient training. The approach shows superior performance over strong baselines, reduces zero-frequency predicates, and provides a practical preprocessing pipeline; it also highlights limitations related to reliance on a proprietary LLM and suggests avenues for grounding with LLMs and exploring smaller models.

Abstract

Weakly-Supervised Scene Graph Generation (WSSGG) research has recently emerged as an alternative to the fully-supervised approach that heavily relies on costly annotations. In this regard, studies on WSSGG have utilized image captions to obtain unlocalized triplets while primarily focusing on grounding the unlocalized triplets over image regions. However, they have overlooked the two issues involved in the triplet formation process from the captions: 1) Semantic over-simplification issue arises when extracting triplets from captions, where fine-grained predicates in captions are undesirably converted into coarse-grained predicates, resulting in a long-tailed predicate distribution, and 2) Low-density scene graph issue arises when aligning the triplets in the caption with entity/predicate classes of interest, where many triplets are discarded and not used in training, leading to insufficient supervision. To tackle the two issues, we propose a new approach, i.e., Large Language Model for weakly-supervised SGG (LLM4SGG), where we mitigate the two issues by leveraging the LLM's in-depth understanding of language and reasoning ability during the extraction of triplets from captions and alignment of entity/predicate classes with target data. To further engage the LLM in these processes, we adopt the idea of Chain-of-Thought and the in-context few-shot learning strategy. To validate the effectiveness of LLM4SGG, we conduct extensive experiments on Visual Genome and GQA datasets, showing significant improvements in both Recall@K and mean Recall@K compared to the state-of-the-art WSSGG methods. A further appeal is that LLM4SGG is data-efficient, enabling effective model training with a small amount of training images.
Paper Structure (43 sections, 1 equation, 15 figures, 13 tables)

This paper contains 43 sections, 1 equation, 15 figures, 13 tables.

Figures (15)

  • Figure 1: (a) The pipeline of weakly-supervised SGG. (b) The predicate distribution of unlocalized triplets (Parser+KB vs. Ours). In Parser+KB, the distribution becomes heavily long-tailed, and 12 out of 50 predicates are non-existent. (c) Semantic over-simplification caused by a rule-based parser in Step 2. (d) Low-density scene graph caused by the static structure of KB in Step 3.
  • Figure 2: The pipeline of LLM4SGG. Given an image with its caption, we use an LLM to extract triplets from the original caption (Step 2-1) and the paraphrased caption (Step 2-2). Then, we align the entity/predicate classes within the extracted triplets with semantically similar lexeme in the target data via an LLM (Step 3), obtaining the unlocalized triplets. Lastly, we ground the unlocalized triplets over image regions (Step 4) followed by the training of an SGG model.
  • Figure 3: Per class performance (Bar: number of predicate instances, Line: Recall@100).
  • Figure 4: Performance over various numbers of images used for training $\text{VS}^3$+LLM4SGG.
  • Figure 5: Framework for addressing large predefined lexicons when aligning classes with those of interest.
  • ...and 10 more figures