Table of Contents
Fetching ...

Audio-Visual Traffic Light State Detection for Urban Robots

Sagar Gupta, Akansel Cosgun

TL;DR

This work tackles pedestrian traffic light state detection for urban robots under challenging visual conditions by proposing a multimodal pipeline that fuses vision and audio cues. Vision relies on a YOLO-based detector combined with HSV hue-based color analysis within bounding boxes, while audio uses MFCC features fed to a Random Forest classifier; frame-level features are fused across a temporal window to enhance robustness. The authors compare feature-level and decision-level fusion strategies, showing that fusion substantially improves accuracy under occlusion and motion, with no-occlusion performance reaching around 99% and motion scenarios exceeding 98%, while maintaining real-time feasibility on consumer hardware. The approach is validated on a dataset collected with a quadruped robot and existing imagery, and is implemented on a Unitree Go1 with ROS and a FastAPI server to drive autonomous navigation across crosswalks, highlighting the practical potential of multi-modal perception in urban robotics.

Abstract

We present a multimodal traffic light state detection using vision and sound, from the viewpoint of a quadruped robot navigating in urban settings. This is a challenging problem because of the visual occlusions and noise from robot locomotion. Our method combines features from raw audio with the ratios of red and green pixels within bounding boxes, identified by established vision-based detectors. The fusion method aggregates features across multiple frames in a given timeframe, increasing robustness and adaptability. Results show that our approach effectively addresses the challenge of visual occlusion and surpasses the performance of single-modality solutions when the robot is in motion. This study serves as a proof of concept, highlighting the significant, yet often overlooked, potential of multi-modal perception in robotics.

Audio-Visual Traffic Light State Detection for Urban Robots

TL;DR

This work tackles pedestrian traffic light state detection for urban robots under challenging visual conditions by proposing a multimodal pipeline that fuses vision and audio cues. Vision relies on a YOLO-based detector combined with HSV hue-based color analysis within bounding boxes, while audio uses MFCC features fed to a Random Forest classifier; frame-level features are fused across a temporal window to enhance robustness. The authors compare feature-level and decision-level fusion strategies, showing that fusion substantially improves accuracy under occlusion and motion, with no-occlusion performance reaching around 99% and motion scenarios exceeding 98%, while maintaining real-time feasibility on consumer hardware. The approach is validated on a dataset collected with a quadruped robot and existing imagery, and is implemented on a Unitree Go1 with ROS and a FastAPI server to drive autonomous navigation across crosswalks, highlighting the practical potential of multi-modal perception in urban robotics.

Abstract

We present a multimodal traffic light state detection using vision and sound, from the viewpoint of a quadruped robot navigating in urban settings. This is a challenging problem because of the visual occlusions and noise from robot locomotion. Our method combines features from raw audio with the ratios of red and green pixels within bounding boxes, identified by established vision-based detectors. The fusion method aggregates features across multiple frames in a given timeframe, increasing robustness and adaptability. Results show that our approach effectively addresses the challenge of visual occlusion and surpasses the performance of single-modality solutions when the robot is in motion. This study serves as a proof of concept, highlighting the significant, yet often overlooked, potential of multi-modal perception in robotics.
Paper Structure (17 sections, 3 equations, 8 figures, 5 tables)

This paper contains 17 sections, 3 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: In urban settings, robots often struggle with visual occlusion, hindering their ability to detect traffic lights. Our solution combines auditory cues with vision, utilizing traffic lights' sound patterns to indicate their state. This method ensures robots can navigate effectively, even when visual signals are obscured.
  • Figure 2: Proposed Audio-Visual Feature Fusion Architecture
  • Figure 3: Examples from our video dataset collected onboard Unitree Go1 robot using smartphone camera
  • Figure 4: PTL state detection using object detection and pixel counting
  • Figure 5: Hue histogram of red and green PTL bounding boxes, averaged over all frames in our video dataset containing no occlusion captured on-board robot.
  • ...and 3 more figures