RALACs: Action Recognition in Autonomous Vehicles using Interaction Encoding and Optical Flow

Eddy Zhou; Alex Zhuang; Alikasim Budhwani; Owen Leather; Rowan Dempster; Quanquan Li; Mohammad Al-Sharman; Derek Rayside; William Melek

RALACs: Action Recognition in Autonomous Vehicles using Interaction Encoding and Optical Flow

Eddy Zhou, Alex Zhuang, Alikasim Budhwani, Owen Leather, Rowan Dempster, Quanquan Li, Mohammad Al-Sharman, Derek Rayside, William Melek

TL;DR

This work tackles action recognition in autonomous vehicle settings by introducing RALACs, a two-stage, online framework that localizes active road agents and classifies their actions. It combines RGB and optical-flow cues, uses a Dynamic ROI Alignment to track moving agents, and extends higher-order interaction encoding to multi-class road agents for robust classification. Key contributions include active-agent detection via flow-RGB fusion, online tube linking with OC-SORT, Dynamic ROI-alignment (DROI), and the adaptation of HR^2O-like interactions to road scenarios, all validated on the ICCV ROAD dataset and demonstrated on a real vehicle. Results show that RALACs outperforms the baseline 3D-RetinaNet in frame- and video-level metrics, and deployment experiments illustrate tangible benefits for perception integration and environment-model decision making.

Abstract

When applied to autonomous vehicle (AV) settings, action recognition can enhance an environment model's situational awareness. This is especially prevalent in scenarios where traditional geometric descriptions and heuristics in AVs are insufficient. However, action recognition has traditionally been studied for humans, and its limited adaptability to noisy, un-clipped, un-pampered, raw RGB data has limited its application in other fields. To push for the advancement and adoption of action recognition into AVs, this work proposes a novel two-stage action recognition system, termed RALACs. RALACs formulates the problem of action recognition for road scenes, and bridges the gap between it and the established field of human action recognition. This work shows how attention layers can be useful for encoding the relations across agents, and stresses how such a scheme can be class-agnostic. Furthermore, to address the dynamic nature of agents on the road, RALACs constructs a novel approach to adapting Region of Interest (ROI) Alignment to agent tracks for downstream action classification. Finally, our scheme also considers the problem of active agent detection, and utilizes a novel application of fusing optical flow maps to discern relevant agents in a road scene. We show that our proposed scheme can outperform the baseline on the ICCV2021 Road Challenge dataset and by deploying it on a real vehicle platform, we provide preliminary insight to the usefulness of action recognition in decision making.

RALACs: Action Recognition in Autonomous Vehicles using Interaction Encoding and Optical Flow

TL;DR

Abstract

Paper Structure (22 sections, 9 equations, 9 figures, 3 tables, 3 algorithms)

This paper contains 22 sections, 9 equations, 9 figures, 3 tables, 3 algorithms.

Introduction
Related Works
Human Action Recognition
Human Action Localization
Optical Flow Estimation
Action Recognition for Autonomous Vehicles
Methodology
Action Localization with Optical Flow
Optical-Flow-based Active Agent Detection
Proposal Filtering and Linking
Action Classification with Interaction Encoding and Tube Priors
Experimentation
ICCV2021 ROAD Challenge
Action Track Post-processing
Implementation Details
...and 7 more sections

Figures (9)

Figure 1: Action Classification pipeline from raw RGB data. Agent tracks are appended with an action attribute which may be useful for downstream decision making. In (c) for example, we observe a pedestrian waiting to cross the street. Because of this attribute, our ego vehicle can have greater confidence in crossing the intersection, knowing that the pedestrian is properly obeying traffic rules. In this work, we focus our attention on making Action Classification more effective and realizable on autonomous vehicles, and we show how the use of encoding interactions between agents, and optical flow fusion detection, can significantly do this.
Figure 2: RALACs system architecture. Given a set of frames in a clip, their optical flow is estimated using RAFT teed2020raft. Both clips are sent frame-by-frame into the object detector where each RGB frame and its corresponding optical flow is encoded and summed up at multiple feature scales. We found that utilizing a pretrained RGB backbone on 3-channel optical flow improved model predictions and accuracy. Following detection, an online object tracker is used to link detections into tubelets. Tubelets present in the key frame of the clip are then fed into our action classifier, which encodes agent tubes through a novel Dynamic ROI-Alignment procedure that takes advantage of the inherent structure of feature encodings outputted by SlowFast. Following encoding, we adopt acar2020's higher-order relations reasoning to compute the attention between each encoded agent tubelet and the other agents present in the clip. The result is the action predictions of agents present in the key frame.
Figure 3: Action Localization comparison between RGB-only and Flow-fused detectors. The use of optical flow provides the network sufficient enough information to classify an agent as active or inactive.
Figure 4: Analysis of Action Classification improvements by action class. We see that the use of Interaction Encoding improves the classification of actions that naturally require more interactions with other agents (see braking and crossing (X) actions). Furthermore, DROI alignment is also shown to be an effective method for extracting the information present in agent tracks, especially during actions of large movements.
Figure 5: Action Classification comparison at various intersection scenes. The top row photos (a-c) are actions classified without the use of Interaction Encoding. Consequently, we see that the network seems oblivious to the context of other agents around it (other agents moving on Green, light colours). In contrast, the use of Interaction Encoding (d-f) seems to correct these faults. We attribute this to the network's newfound ability to comprehend action interactions. Note: Only pedestrians can be considered as Waiting to Cross.
...and 4 more figures

RALACs: Action Recognition in Autonomous Vehicles using Interaction Encoding and Optical Flow

TL;DR

Abstract

RALACs: Action Recognition in Autonomous Vehicles using Interaction Encoding and Optical Flow

Authors

TL;DR

Abstract

Table of Contents

Figures (9)