Table of Contents
Fetching ...

SCoRD: Subject-Conditional Relation Detection with Text-Augmented Data

Ziyan Yang, Kushal Kafle, Zhe Lin, Scott Cohen, Zhihong Ding, Vicente Ordonez

TL;DR

This work introduces SCoRD, a subject-conditioned relation-detection task that enumerates all relevant relation–object pairs and their locations given a subject in an image. The authors propose SCoRDNet, a transformer-based auto-regressive model that decodes relation–object sequences and corresponding bounding boxes, using a two-step decoding process to maintain diversity while ensuring grounding. A new Open Images-based benchmark, OIv6-SCoRD, employs distribution-shifted train/test splits to stress generalization beyond dataset priors, and leverages ungrounded text-augmented data from image-caption sources to improve performance on unseen triplets. Results show substantial gains from text augmentation, including strong performance on unseen or underrepresented pairs, and demonstrate competitive grounding relative to established scene-graph methods, highlighting the value of integrating image-text data for open-vocabulary relation grounding.

Abstract

We propose Subject-Conditional Relation Detection SCoRD, where conditioned on an input subject, the goal is to predict all its relations to other objects in a scene along with their locations. Based on the Open Images dataset, we propose a challenging OIv6-SCoRD benchmark such that the training and testing splits have a distribution shift in terms of the occurrence statistics of $\langle$subject, relation, object$\rangle$ triplets. To solve this problem, we propose an auto-regressive model that given a subject, it predicts its relations, objects, and object locations by casting this output as a sequence of tokens. First, we show that previous scene-graph prediction methods fail to produce as exhaustive an enumeration of relation-object pairs when conditioned on a subject on this benchmark. Particularly, we obtain a recall@3 of 83.8% for our relation-object predictions compared to the 49.75% obtained by a recent scene graph detector. Then, we show improved generalization on both relation-object and object-box predictions by leveraging during training relation-object pairs obtained automatically from textual captions and for which no object-box annotations are available. Particularly, for $\langle$subject, relation, object$\rangle$ triplets for which no object locations are available during training, we are able to obtain a recall@3 of 33.80% for relation-object pairs and 26.75% for their box locations.

SCoRD: Subject-Conditional Relation Detection with Text-Augmented Data

TL;DR

This work introduces SCoRD, a subject-conditioned relation-detection task that enumerates all relevant relation–object pairs and their locations given a subject in an image. The authors propose SCoRDNet, a transformer-based auto-regressive model that decodes relation–object sequences and corresponding bounding boxes, using a two-step decoding process to maintain diversity while ensuring grounding. A new Open Images-based benchmark, OIv6-SCoRD, employs distribution-shifted train/test splits to stress generalization beyond dataset priors, and leverages ungrounded text-augmented data from image-caption sources to improve performance on unseen triplets. Results show substantial gains from text augmentation, including strong performance on unseen or underrepresented pairs, and demonstrate competitive grounding relative to established scene-graph methods, highlighting the value of integrating image-text data for open-vocabulary relation grounding.

Abstract

We propose Subject-Conditional Relation Detection SCoRD, where conditioned on an input subject, the goal is to predict all its relations to other objects in a scene along with their locations. Based on the Open Images dataset, we propose a challenging OIv6-SCoRD benchmark such that the training and testing splits have a distribution shift in terms of the occurrence statistics of subject, relation, object triplets. To solve this problem, we propose an auto-regressive model that given a subject, it predicts its relations, objects, and object locations by casting this output as a sequence of tokens. First, we show that previous scene-graph prediction methods fail to produce as exhaustive an enumeration of relation-object pairs when conditioned on a subject on this benchmark. Particularly, we obtain a recall@3 of 83.8% for our relation-object predictions compared to the 49.75% obtained by a recent scene graph detector. Then, we show improved generalization on both relation-object and object-box predictions by leveraging during training relation-object pairs obtained automatically from textual captions and for which no object-box annotations are available. Particularly, for subject, relation, object triplets for which no object locations are available during training, we are able to obtain a recall@3 of 33.80% for relation-object pairs and 26.75% for their box locations.
Paper Structure (20 sections, 5 equations, 5 figures, 8 tables, 1 algorithm)

This paper contains 20 sections, 5 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: We cast subject-conditional relation detection (SCoRD) as a sequence decoding task where given an input subject in an image, we predict relation-object predicates and their locations. We also demonstrate how to leverage ungrounded training samples extracted from parsing textual captions. These samples are easier to obtain than fully grounded samples and have a potentially wider coverage of relation-object pairs.
  • Figure 2: Here we show an overview of SCoRDNet. On the left we show how we handle fully grounded training samples consisting of a $\langle$subject, relation, object$\rangle$ triplet for which a box location is available for both the subject and the object. On the right, we show how we handle a training sample for which we only have a $\langle$subject, relation, object$\rangle$ triplet but no object location is available. In the second case, we do not backpropagate gradients through the prediction heads corresponding to object location coordinates although the model often makes reasonable predictions based on parameter sharing with other similar samples that are fully annotated.
  • Figure 3: Qualitative results from the models trained with and without text augmentation from COCO and CC. "Base" indicates results generated by the model trained with 50% training samples for the relation-object pairs in Rel-Obj Set A and no samples for the relation-object pairs in Rel-Obj Set B. "Text-aug." indicates results generated by the model trained with Text-Augmented Training Split.
  • Figure 4: Qualitative results from the model trained with fully grounded data. Subjects and predicted relation-object pairs are shown under images, and bounding boxes correspond to subjects and objects with the same colors. During inference time, all subjects and bounding boxes for subjects are provided as inputs, and we show three predicted relation-object pairs with highest scores.
  • Figure 5: Qualitative results from the models trained with and without text augmentation from COCO and CC. "Base" indicates results generated by the model trained with 50% training samples for the relation-object pairs in Rel-Obj Set A and no samples for the relation-object pairs in Rel-Obj Set B. "Text-aug" indicates results generated by the model trained with additional ungrounded data for both relation-object pairs in Rel-Obj Set A and Rel-Obj Set B from COCO and CC.