Table of Contents
Fetching ...

CYCLO: Cyclic Graph Transformer Approach to Multi-Object Relationship Modeling in Aerial Videos

Trong-Thuan Nguyen, Pha Nguyen, Xin Li, Jackson Cothren, Alper Yilmaz, Khoa Luu

TL;DR

The novel Cyclic Graph Transformer (CYCLO) approach is proposed that allows the model to capture both direct and long-range temporal dependencies by continuously updating the history of interactions in a circular manner and can effectively capture periodic and overlapping relationships while minimizing information loss.

Abstract

Video scene graph generation (VidSGG) has emerged as a transformative approach to capturing and interpreting the intricate relationships among objects and their temporal dynamics in video sequences. In this paper, we introduce the new AeroEye dataset that focuses on multi-object relationship modeling in aerial videos. Our AeroEye dataset features various drone scenes and includes a visually comprehensive and precise collection of predicates that capture the intricate relationships and spatial arrangements among objects. To this end, we propose the novel Cyclic Graph Transformer (CYCLO) approach that allows the model to capture both direct and long-range temporal dependencies by continuously updating the history of interactions in a circular manner. The proposed approach also allows one to handle sequences with inherent cyclical patterns and process object relationships in the correct sequential order. Therefore, it can effectively capture periodic and overlapping relationships while minimizing information loss. The extensive experiments on the AeroEye dataset demonstrate the effectiveness of the proposed CYCLO model, demonstrating its potential to perform scene understanding on drone videos. Finally, the CYCLO method consistently achieves State-of-the-Art (SOTA) results on two in-the-wild scene graph generation benchmarks, i.e., PVSG and ASPIRe.

CYCLO: Cyclic Graph Transformer Approach to Multi-Object Relationship Modeling in Aerial Videos

TL;DR

The novel Cyclic Graph Transformer (CYCLO) approach is proposed that allows the model to capture both direct and long-range temporal dependencies by continuously updating the history of interactions in a circular manner and can effectively capture periodic and overlapping relationships while minimizing information loss.

Abstract

Video scene graph generation (VidSGG) has emerged as a transformative approach to capturing and interpreting the intricate relationships among objects and their temporal dynamics in video sequences. In this paper, we introduce the new AeroEye dataset that focuses on multi-object relationship modeling in aerial videos. Our AeroEye dataset features various drone scenes and includes a visually comprehensive and precise collection of predicates that capture the intricate relationships and spatial arrangements among objects. To this end, we propose the novel Cyclic Graph Transformer (CYCLO) approach that allows the model to capture both direct and long-range temporal dependencies by continuously updating the history of interactions in a circular manner. The proposed approach also allows one to handle sequences with inherent cyclical patterns and process object relationships in the correct sequential order. Therefore, it can effectively capture periodic and overlapping relationships while minimizing information loss. The extensive experiments on the AeroEye dataset demonstrate the effectiveness of the proposed CYCLO model, demonstrating its potential to perform scene understanding on drone videos. Finally, the CYCLO method consistently achieves State-of-the-Art (SOTA) results on two in-the-wild scene graph generation benchmarks, i.e., PVSG and ASPIRe.
Paper Structure (29 sections, 10 equations, 11 figures, 10 tables)

This paper contains 29 sections, 10 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Multi-Object Relationship Modeling in Aerial videos analyzes a drone-captured video to detect and refine object relationships over time. The CYCLO model first identifies relationships between objects in individual frames and then incorporates temporal information about object positions and interactions to refine the understanding of those relationships across the video sequence. (Best viewed in colors)
  • Figure 2: Comparisons of CYCLO and existing relationship modeling: (a) Progressionyang2022panopticshang2021video: frame-wise fusion and classification; (b) Batch-progressioncong2021spatialkhandelwal2022iterativenag2023unbiased: temporal transformer; (c) Hierarchynguyen2024hig: spatiotemporal graph; (d) Our CYCLO approach: circular connectivity for capturing temporal dependencies.
  • Figure 3: Example annotation in our dataset. In Fig. \ref{['fig:examples']}b, straight arrows denote relationships between objects, while curved arrows indicate the positions of the objects. Nodes of the same color represent the same object, and the labels on the edges specify the predicate of each relationship. (Best viewed in colors)
  • Figure 4: Relationship word cloud on AeroEye dataset.
  • Figure 5: Statistics for each scene on the AeroEye dataset.
  • ...and 6 more figures