Table of Contents
Fetching ...

Compile Scene Graphs with Reinforcement Learning

Zuyao Chen, Jinlin Wu, Zhen Lei, Marc Pollefeys, Chang Wen Chen

TL;DR

<3-5 sentence high-level summary> The paper addresses the challenge of end-to-end scene graph generation with multimodal large language models by proposing R1-SGG, a two-stage framework that combines supervised fine-tuning with reinforcement learning. It introduces graph-centric rewards, including Hard Recall, Hard Recall+Relax, Soft Recall, and a format-consistency reward, and optimizes with Group Relative Policy Optimization to align outputs with SGDET metrics. Empirical results on VG150 and PSG show substantial reductions in failure rate and substantial gains in Recall and mean Recall, surpassing both traditional SGG models and existing multimodal LLMs. The work demonstrates the value of RL-driven structured output for multimodal reasoning and provides open-source resources to advance structured visual understanding with M-LLMs.

Abstract

Next-token prediction is the fundamental principle for training large language models (LLMs), and reinforcement learning (RL) further enhances their reasoning performance. As an effective way to model language, image, video, and other modalities, the use of LLMs for end-to-end extraction of structured visual representations, such as scene graphs, remains underexplored. It requires the model to accurately produce a set of objects and relationship triplets, rather than generating text token by token. To achieve this, we introduce R1-SGG, a multimodal LLM (M-LLM) initially trained via supervised fine-tuning (SFT) on the scene graph dataset and subsequently refined using reinforcement learning to enhance its ability to generate scene graphs in an end-to-end manner. The SFT follows a conventional prompt-response paradigm, while RL requires the design of effective reward signals. We design a set of graph-centric rewards, including three recall-based variants -- Hard Recall, Hard Recall+Relax, and Soft Recall -- which evaluate semantic and spatial alignment between predictions and ground truth at the object and relation levels. A format consistency reward further ensures that outputs follow the expected structural schema. Extensive experiments on the VG150 and PSG benchmarks show that R1-SGG substantially reduces failure rates and achieves strong performance in Recall and mean Recall, surpassing traditional SGG models and existing multimodal language models. Our code is available at https://github.com/gpt4vision/R1-SGG

Compile Scene Graphs with Reinforcement Learning

TL;DR

<3-5 sentence high-level summary> The paper addresses the challenge of end-to-end scene graph generation with multimodal large language models by proposing R1-SGG, a two-stage framework that combines supervised fine-tuning with reinforcement learning. It introduces graph-centric rewards, including Hard Recall, Hard Recall+Relax, Soft Recall, and a format-consistency reward, and optimizes with Group Relative Policy Optimization to align outputs with SGDET metrics. Empirical results on VG150 and PSG show substantial reductions in failure rate and substantial gains in Recall and mean Recall, surpassing both traditional SGG models and existing multimodal LLMs. The work demonstrates the value of RL-driven structured output for multimodal reasoning and provides open-source resources to advance structured visual understanding with M-LLMs.

Abstract

Next-token prediction is the fundamental principle for training large language models (LLMs), and reinforcement learning (RL) further enhances their reasoning performance. As an effective way to model language, image, video, and other modalities, the use of LLMs for end-to-end extraction of structured visual representations, such as scene graphs, remains underexplored. It requires the model to accurately produce a set of objects and relationship triplets, rather than generating text token by token. To achieve this, we introduce R1-SGG, a multimodal LLM (M-LLM) initially trained via supervised fine-tuning (SFT) on the scene graph dataset and subsequently refined using reinforcement learning to enhance its ability to generate scene graphs in an end-to-end manner. The SFT follows a conventional prompt-response paradigm, while RL requires the design of effective reward signals. We design a set of graph-centric rewards, including three recall-based variants -- Hard Recall, Hard Recall+Relax, and Soft Recall -- which evaluate semantic and spatial alignment between predictions and ground truth at the object and relation levels. A format consistency reward further ensures that outputs follow the expected structural schema. Extensive experiments on the VG150 and PSG benchmarks show that R1-SGG substantially reduces failure rates and achieves strong performance in Recall and mean Recall, surpassing traditional SGG models and existing multimodal language models. Our code is available at https://github.com/gpt4vision/R1-SGG

Paper Structure

This paper contains 29 sections, 7 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Comparison of multimodal LLMs (M-LLMs) fine-tuned via Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) for Scene Graph Generation (SGG).
  • Figure 2: Comparison of R1-SGG-Zero and R1-SGG models against SFT baselines (Qwen2-VL-2B/7B-Instruct) across training steps on the VG150 validation set in terms of Failure Rate (%), AP@50, and Recall (%).
  • Figure 3: Comparison of predicate frequency and predicate-wise recall on the VG150 validation set. Subfigures (b) and (c) report the recall performance of R1-SGG compared to four models on the top-24 and tail-25 predicates (the VG150 validation set contains only 49 predicates, with the predicate "flying in" missing.), respectively. Here, Baseline refers to Qwen2-VL-7B-Instruct.
  • Figure 4: Comparison of predicate frequency and predicate-wise recall on the PSG test set. Subfigures (b) and (c) report the recall performance of R1-SGG compared to four models on the top-28 and tail-28 predicates, respectively. Here, Baseline refers to Qwen2-VL-7B-Instruct.
  • Figure 5: Performance comparison of R1-SGG (2B) across training steps on the VG150 validation set. Each row evaluates a different setting: (Top) KL divergence regularization ($\beta{=}0.04$ vs. $\beta{=}0$), (Middle) sampling length, and (Bottom) group size. Metrics reported include Failure Rate (%), AP@50, and Recall (%).
  • ...and 2 more figures