Audio-Visual Traffic Light State Detection for Urban Robots

Sagar Gupta; Akansel Cosgun

Audio-Visual Traffic Light State Detection for Urban Robots

Sagar Gupta, Akansel Cosgun

TL;DR

This work tackles pedestrian traffic light state detection for urban robots under challenging visual conditions by proposing a multimodal pipeline that fuses vision and audio cues. Vision relies on a YOLO-based detector combined with HSV hue-based color analysis within bounding boxes, while audio uses MFCC features fed to a Random Forest classifier; frame-level features are fused across a temporal window to enhance robustness. The authors compare feature-level and decision-level fusion strategies, showing that fusion substantially improves accuracy under occlusion and motion, with no-occlusion performance reaching around 99% and motion scenarios exceeding 98%, while maintaining real-time feasibility on consumer hardware. The approach is validated on a dataset collected with a quadruped robot and existing imagery, and is implemented on a Unitree Go1 with ROS and a FastAPI server to drive autonomous navigation across crosswalks, highlighting the practical potential of multi-modal perception in urban robotics.

Abstract

We present a multimodal traffic light state detection using vision and sound, from the viewpoint of a quadruped robot navigating in urban settings. This is a challenging problem because of the visual occlusions and noise from robot locomotion. Our method combines features from raw audio with the ratios of red and green pixels within bounding boxes, identified by established vision-based detectors. The fusion method aggregates features across multiple frames in a given timeframe, increasing robustness and adaptability. Results show that our approach effectively addresses the challenge of visual occlusion and surpasses the performance of single-modality solutions when the robot is in motion. This study serves as a proof of concept, highlighting the significant, yet often overlooked, potential of multi-modal perception in robotics.

Audio-Visual Traffic Light State Detection for Urban Robots

TL;DR

Abstract

Paper Structure (17 sections, 3 equations, 8 figures, 5 tables)

This paper contains 17 sections, 3 equations, 8 figures, 5 tables.

Introduction
System Overview
Vision-based Detection
Audio-based Detection
Number of Mel Frequency Cepstral Coefficients
Including delta and delta-delta coefficients
Choice of Frame Length and Algorithm
Performance on Video Dataset
Audio-Visual Fusion
Feature Level Fusion
Decision Level Fusion
Results
Under no visual occlusion on stationary robot
Under Visual Occlusion
Under Robot Movement
...and 2 more sections

Figures (8)

Figure 1: In urban settings, robots often struggle with visual occlusion, hindering their ability to detect traffic lights. Our solution combines auditory cues with vision, utilizing traffic lights' sound patterns to indicate their state. This method ensures robots can navigate effectively, even when visual signals are obscured.
Figure 2: Proposed Audio-Visual Feature Fusion Architecture
Figure 3: Examples from our video dataset collected onboard Unitree Go1 robot using smartphone camera
Figure 4: PTL state detection using object detection and pixel counting
Figure 5: Hue histogram of red and green PTL bounding boxes, averaged over all frames in our video dataset containing no occlusion captured on-board robot.
...and 3 more figures

Audio-Visual Traffic Light State Detection for Urban Robots

TL;DR

Abstract

Audio-Visual Traffic Light State Detection for Urban Robots

Authors

TL;DR

Abstract

Table of Contents

Figures (8)