SCoRD: Subject-Conditional Relation Detection with Text-Augmented Data
Ziyan Yang, Kushal Kafle, Zhe Lin, Scott Cohen, Zhihong Ding, Vicente Ordonez
TL;DR
This work introduces SCoRD, a subject-conditioned relation-detection task that enumerates all relevant relation–object pairs and their locations given a subject in an image. The authors propose SCoRDNet, a transformer-based auto-regressive model that decodes relation–object sequences and corresponding bounding boxes, using a two-step decoding process to maintain diversity while ensuring grounding. A new Open Images-based benchmark, OIv6-SCoRD, employs distribution-shifted train/test splits to stress generalization beyond dataset priors, and leverages ungrounded text-augmented data from image-caption sources to improve performance on unseen triplets. Results show substantial gains from text augmentation, including strong performance on unseen or underrepresented pairs, and demonstrate competitive grounding relative to established scene-graph methods, highlighting the value of integrating image-text data for open-vocabulary relation grounding.
Abstract
We propose Subject-Conditional Relation Detection SCoRD, where conditioned on an input subject, the goal is to predict all its relations to other objects in a scene along with their locations. Based on the Open Images dataset, we propose a challenging OIv6-SCoRD benchmark such that the training and testing splits have a distribution shift in terms of the occurrence statistics of $\langle$subject, relation, object$\rangle$ triplets. To solve this problem, we propose an auto-regressive model that given a subject, it predicts its relations, objects, and object locations by casting this output as a sequence of tokens. First, we show that previous scene-graph prediction methods fail to produce as exhaustive an enumeration of relation-object pairs when conditioned on a subject on this benchmark. Particularly, we obtain a recall@3 of 83.8% for our relation-object predictions compared to the 49.75% obtained by a recent scene graph detector. Then, we show improved generalization on both relation-object and object-box predictions by leveraging during training relation-object pairs obtained automatically from textual captions and for which no object-box annotations are available. Particularly, for $\langle$subject, relation, object$\rangle$ triplets for which no object locations are available during training, we are able to obtain a recall@3 of 33.80% for relation-object pairs and 26.75% for their box locations.
