Audio-Visual Traffic Light State Detection for Urban Robots
Sagar Gupta, Akansel Cosgun
TL;DR
This work tackles pedestrian traffic light state detection for urban robots under challenging visual conditions by proposing a multimodal pipeline that fuses vision and audio cues. Vision relies on a YOLO-based detector combined with HSV hue-based color analysis within bounding boxes, while audio uses MFCC features fed to a Random Forest classifier; frame-level features are fused across a temporal window to enhance robustness. The authors compare feature-level and decision-level fusion strategies, showing that fusion substantially improves accuracy under occlusion and motion, with no-occlusion performance reaching around 99% and motion scenarios exceeding 98%, while maintaining real-time feasibility on consumer hardware. The approach is validated on a dataset collected with a quadruped robot and existing imagery, and is implemented on a Unitree Go1 with ROS and a FastAPI server to drive autonomous navigation across crosswalks, highlighting the practical potential of multi-modal perception in urban robotics.
Abstract
We present a multimodal traffic light state detection using vision and sound, from the viewpoint of a quadruped robot navigating in urban settings. This is a challenging problem because of the visual occlusions and noise from robot locomotion. Our method combines features from raw audio with the ratios of red and green pixels within bounding boxes, identified by established vision-based detectors. The fusion method aggregates features across multiple frames in a given timeframe, increasing robustness and adaptability. Results show that our approach effectively addresses the challenge of visual occlusion and surpasses the performance of single-modality solutions when the robot is in motion. This study serves as a proof of concept, highlighting the significant, yet often overlooked, potential of multi-modal perception in robotics.
