DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement
Shaoqing Lin, Chong Teng, Fei Li, Donghong Ji, Lizhen Qu, Zhuang Li
TL;DR
DiscoSG introduces a discourse-level text scene graph parsing task and the DiscoSG-DS dataset to capture cross-sentence phenomena in multi-sentence captions. It proposes DiscoSG-Refiner, a lightweight iterative framework that starts from a seed graph and refines it via a dedicated Programmer and deterministic Interpreter, achieving ~30% higher SPICE than sentence-merging baselines and up to $86\times$ faster inference than GPT-4o. The approach generalizes from simple to dense graphs and improves downstream Vision-Language tasks, including discourse-level caption evaluation and hallucination detection, with open-source models outperforming strong sentence-level baselines. It also introduces D-FOIL as a benchmark for discourse-level hallucination detection, demonstrating that graph-based metrics can better reflect semantic correctness than traditional n-gram or embedding-based metrics. Overall, DiscoSG enables robust, cost-effective discourse-level graph parsing that benefits VLM evaluation and downstream tasks while remaining accessible as open-source software.
Abstract
Vision-Language Models (VLMs) generate discourse-level, multi-sentence visual descriptions, challenging text scene graph parsers built for single-sentence caption-to-graph mapping. Current approaches typically merge sentence-level parsing outputs for discourse input, often missing phenomena like cross-sentence coreference, resulting in fragmented graphs and degraded downstream VLM task performance. We introduce a new task, Discourse-level text Scene Graph parsing (DiscoSG), and release DiscoSG-DS, a dataset of 400 expert-annotated and 8,430 synthesised multi-sentence caption-graph pairs. Each caption averages 9 sentences, and each graph contains at least 3 times more triples than those in existing datasets. Fine-tuning GPT-4o on DiscoSG-DS yields over 40% higher SPICE metric than the best sentence-merging baseline. However, its high inference cost and licensing restrict open-source use. Smaller fine-tuned open-source models (e.g., Flan-T5) perform well on simpler graphs yet degrade on denser, more complex graphs. To bridge this gap, we introduce DiscoSG-Refiner, a lightweight open-source parser that drafts a seed graph and iteratively refines it with a novel learned graph-editing model, achieving 30% higher SPICE than the baseline while delivering 86 times faster inference than GPT-4o. It generalises from simple to dense graphs, thereby consistently improving downstream VLM tasks, including discourse-level caption evaluation and hallucination detection, outperforming alternative open-source parsers. Code and data are available at https://github.com/ShaoqLin/DiscoSG .
