SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation

Jiaye Feng; Qixiang Yin; Yuankun Liu; Tong Mo; Weiping Li

SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation

Jiaye Feng, Qixiang Yin, Yuankun Liu, Tong Mo, Weiping Li

TL;DR

A novel dual-granularity reward is proposed which integrates fine-grained and coarse-grained relation rewards, simultaneously mitigating the long-tail issue via frequency-based adaptive weighting of predicates and improving relation coverage through semantic clustering.

Abstract

Scene Graph Generation (SGG) structures visual scenes as graphs of objects and their relations. While Multimodal Large Language Models (MLLMs) have advanced end-to-end SGG, current methods are hindered by both a lack of task-specific structured reasoning and the challenges of sparse, long-tailed relation distributions, resulting in incomplete scene graphs characterized by low recall and biased predictions. To address these issues, we introduce SGG-R$^{\rm 3}$, a structured reasoning framework that integrates task-specific chain-of-thought (CoT)-guided supervised fine-tuning (SFT) and reinforcement learning (RL) with group sequence policy optimization (GSPO), designed to engage in three sequential stages to achieve end-to-end unbiased scene graph generation. During the SFT phase, we propose a relation augmentation strategy by leveraging an MLLM and refined via embedding similarity filtering to alleviate relation sparsity. Subsequently, a stage-aligned reward scheme optimizes the procedural reasoning during RL. Specifically, we propose a novel dual-granularity reward which integrates fine-grained and coarse-grained relation rewards, simultaneously mitigating the long-tail issue via frequency-based adaptive weighting of predicates and improving relation coverage through semantic clustering. Experiments on two benchmarks show that SGG-R$^{\rm 3}$ achieves superior performance compared to existing methods, demonstrating the effectiveness and generalization of the framework.

SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation

TL;DR

Abstract

, a structured reasoning framework that integrates task-specific chain-of-thought (CoT)-guided supervised fine-tuning (SFT) and reinforcement learning (RL) with group sequence policy optimization (GSPO), designed to engage in three sequential stages to achieve end-to-end unbiased scene graph generation. During the SFT phase, we propose a relation augmentation strategy by leveraging an MLLM and refined via embedding similarity filtering to alleviate relation sparsity. Subsequently, a stage-aligned reward scheme optimizes the procedural reasoning during RL. Specifically, we propose a novel dual-granularity reward which integrates fine-grained and coarse-grained relation rewards, simultaneously mitigating the long-tail issue via frequency-based adaptive weighting of predicates and improving relation coverage through semantic clustering. Experiments on two benchmarks show that SGG-R

achieves superior performance compared to existing methods, demonstrating the effectiveness and generalization of the framework.

Paper Structure (34 sections, 14 equations, 10 figures, 9 tables)

This paper contains 34 sections, 14 equations, 10 figures, 9 tables.

Introduction
Related Works
VLM for Scene Graph Generation
Chain of Thought in Visual Reasoning
Methodology
Problem Formulation
Three-stage Structured Reasoning
Type-aware Relation Augmentation
Reward Modeling
Format Reward
Object Category Detection Reward
Object Instance Grounding Reward
Dual-granularity Reward
Reinforcement Learning with Group Sequence Policy Optimization
Experiments
...and 19 more sections

Figures (10)

Figure 1: Pipeline comparison of scene graph generation between the traditional two-stage classification framework and the end-to-end generative MLLM method.
Figure 2: Overview of the SGG-R3 framework. The "R3" denotes three key contributions: Relation augmentation, structured Reasoning, and Reward alignment. First, candidate relations generated by the Qwen2.5-VL-32B with CoT prompt are filtered via Sentence-BERT embedding similarity against the original data. The model is then supervised fine-tuned on the CoT-formatted augmented data with CoT prompt, followed by reward-driven reinforcement learning aligned with the original dataset.
Figure 3: Coarse-grained semantic clustering reward. A triplet is matched if its semantic embedding aligns with any ground-truth cluster centroid beyond a threshold, relaxing the strict matching requirement.
Figure 4: Quantitative analysis of the average number of relations generated per image on the VG150 and PSG test sets: SFT, SFT + RL; with/without relation augmentation (RA).
Figure 5: Comparison between original and CoT data. The original data are transformed into our three-stage format through a structured curation process.
...and 5 more figures

SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation

TL;DR

Abstract

SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)