Table of Contents
Fetching ...

SemCo: Toward Semantic Coherent Visual Relationship Forecasting

Yangjun Ou, Yao Liu, Li Mi, Zhenzhong Chen

TL;DR

The paper tackles visual relationship forecasting (VRF) by addressing semantic coherence in object interactions. It introduces SemCoBench, a two-dataset benchmark with cleaner relation annotations and tightly coupled relation dynamics, to enable learning coherent relation transitions within short action windows. The proposed SemCoFormer integrates a Relationship Augmented Module (RAM) to fuse visual and semantic cues and a Coherence Reasoning Module (CRM) to model cross-frame dynamics via sparse coding, achieving superior accuracy and mAP over strong baselines. The work demonstrates that explicitly modeling semantic coherence and cross-modal cues substantially improves fine-grained, diverse VRF, offering a more robust pathway to reasoning about future video scenes.

Abstract

Visual Relationship Forecasting (VRF) aims to anticipate relations among objects without observing future visual content. The task relies on capturing and modeling the semantic coherence in object interactions, as it underpins the evolution of events and scenes in videos. However, existing VRF datasets offer limited support for learning such coherence due to noisy annotations in the datasets and weak correlations between different actions and relationship transitions in subject-object pair. Furthermore, existing methods struggle to distinguish similar relationships and overfit to unchanging relationships in consecutive frames. To address these challenges, we present SemCoBench, a benchmark that emphasizes semantic coherence for visual relationship forecasting. Based on action labels and short-term subject-object pairs, SemCoBench decomposes relationship categories and dynamics by cleaning and reorganizing video datasets to ensure predicting semantic coherence in object interactions. In addition, we also present Semantic Coherent Transformer method (SemCoFormer) to model the semantic coherence with a Relationship Augmented Module (RAM) and a Coherence Reasoning Module (CRM). RAM is designed to distinguish similar relationships, and CRM facilitates the model's focus on the dynamics in relationships. The experimental results on SemCoBench demonstrate that modeling the semantic coherence is a key step toward reasonable, fine-grained, and diverse visual relationship forecasting, contributing to a more comprehensive understanding of video scenes.

SemCo: Toward Semantic Coherent Visual Relationship Forecasting

TL;DR

The paper tackles visual relationship forecasting (VRF) by addressing semantic coherence in object interactions. It introduces SemCoBench, a two-dataset benchmark with cleaner relation annotations and tightly coupled relation dynamics, to enable learning coherent relation transitions within short action windows. The proposed SemCoFormer integrates a Relationship Augmented Module (RAM) to fuse visual and semantic cues and a Coherence Reasoning Module (CRM) to model cross-frame dynamics via sparse coding, achieving superior accuracy and mAP over strong baselines. The work demonstrates that explicitly modeling semantic coherence and cross-modal cues substantially improves fine-grained, diverse VRF, offering a more robust pathway to reasoning about future video scenes.

Abstract

Visual Relationship Forecasting (VRF) aims to anticipate relations among objects without observing future visual content. The task relies on capturing and modeling the semantic coherence in object interactions, as it underpins the evolution of events and scenes in videos. However, existing VRF datasets offer limited support for learning such coherence due to noisy annotations in the datasets and weak correlations between different actions and relationship transitions in subject-object pair. Furthermore, existing methods struggle to distinguish similar relationships and overfit to unchanging relationships in consecutive frames. To address these challenges, we present SemCoBench, a benchmark that emphasizes semantic coherence for visual relationship forecasting. Based on action labels and short-term subject-object pairs, SemCoBench decomposes relationship categories and dynamics by cleaning and reorganizing video datasets to ensure predicting semantic coherence in object interactions. In addition, we also present Semantic Coherent Transformer method (SemCoFormer) to model the semantic coherence with a Relationship Augmented Module (RAM) and a Coherence Reasoning Module (CRM). RAM is designed to distinguish similar relationships, and CRM facilitates the model's focus on the dynamics in relationships. The experimental results on SemCoBench demonstrate that modeling the semantic coherence is a key step toward reasonable, fine-grained, and diverse visual relationship forecasting, contributing to a more comprehensive understanding of video scenes.

Paper Structure

This paper contains 31 sections, 17 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: The illustration of the action-level forecasting tasks (a), object-level forecasting task (c) and the relation-level forecasting task (b). Different from the action-level and object-level forecasting, relation-level forecasting is conducted to predict future interactions between objects on a fine-grained time scale based on the semantic coherence to achieve an action or event.
  • Figure 2: Comparison of the (a) previous dataset (i.e., Action Genome dataset) and (b) the proposed dataset for VRF task. (a) The Action Genome dataset requires predicting relations that are associated with different action labels (e.g., 'Opening refrigerator', 'Closing refrigerator', and 'Holding') or subject-object pairs (e.g., person-refrigerator, person-cup) or relations that are not linked to certain actions (e.g., in front of) (b) The proposed SemCoBench. Each sample is associated with an action (e.g., 'Opening refrigerator') or a subject-object pair (e.g., person-refrigerator) in a short period of video to ensure the semantic coherence of relation dynamics.
  • Figure 3: Object and relation statistics per category with colors indicating different types in SemCo-VidOR and SemCo-AG datasets.
  • Figure 4: The number of keyframes per video in SemCo-VidOR and SemCo-AG datasets. The horizontal axis represents the number of videos, and the vertical axis represents the number of keyframes per video.
  • Figure 5: Examples of the SemCo-AG datasets. Selected video frames of Picking up something and Lying, Standing, Sitting are showed.
  • ...and 7 more figures