Table of Contents
Fetching ...

Cross-domain Few-shot In-context Learning for Enhancing Traffic Sign Recognition

Yaozong Gan, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

TL;DR

The paper tackles cross-domain traffic sign recognition with limited labeled data by pairing a ViT-Adapter-based traffic sign detector with a cross-domain few-shot in-context learning strategy on multimodal large language models. It generates textual descriptions from template traffic signs to bridge domain gaps and guide the MLLM in fine-grained classification, reducing reliance on large labeled datasets. Empirical results on four datasets show significant performance gains over baselines and demonstrate the practicality of MLLMs for cross-country TSR. This approach offers a scalable path toward robust TSR in diverse real-world conditions and across countries.

Abstract

Recent multimodal large language models (MLLM) such as GPT-4o and GPT-4v have shown great potential in autonomous driving. In this paper, we propose a cross-domain few-shot in-context learning method based on the MLLM for enhancing traffic sign recognition (TSR). We first construct a traffic sign detection network based on Vision Transformer Adapter and an extraction module to extract traffic signs from the original road images. To reduce the dependence on training data and improve the performance stability of cross-country TSR, we introduce a cross-domain few-shot in-context learning method based on the MLLM. To enhance MLLM's fine-grained recognition ability of traffic signs, the proposed method generates corresponding description texts using template traffic signs. These description texts contain key information about the shape, color, and composition of traffic signs, which can stimulate the ability of MLLM to perceive fine-grained traffic sign categories. By using the description texts, our method reduces the cross-domain differences between template and real traffic signs. Our approach requires only simple and uniform textual indications, without the need for large-scale traffic sign images and labels. We perform comprehensive evaluations on the German traffic sign recognition benchmark dataset, the Belgium traffic sign dataset, and two real-world datasets taken from Japan. The experimental results show that our method significantly enhances the TSR performance.

Cross-domain Few-shot In-context Learning for Enhancing Traffic Sign Recognition

TL;DR

The paper tackles cross-domain traffic sign recognition with limited labeled data by pairing a ViT-Adapter-based traffic sign detector with a cross-domain few-shot in-context learning strategy on multimodal large language models. It generates textual descriptions from template traffic signs to bridge domain gaps and guide the MLLM in fine-grained classification, reducing reliance on large labeled datasets. Empirical results on four datasets show significant performance gains over baselines and demonstrate the practicality of MLLMs for cross-country TSR. This approach offers a scalable path toward robust TSR in diverse real-world conditions and across countries.

Abstract

Recent multimodal large language models (MLLM) such as GPT-4o and GPT-4v have shown great potential in autonomous driving. In this paper, we propose a cross-domain few-shot in-context learning method based on the MLLM for enhancing traffic sign recognition (TSR). We first construct a traffic sign detection network based on Vision Transformer Adapter and an extraction module to extract traffic signs from the original road images. To reduce the dependence on training data and improve the performance stability of cross-country TSR, we introduce a cross-domain few-shot in-context learning method based on the MLLM. To enhance MLLM's fine-grained recognition ability of traffic signs, the proposed method generates corresponding description texts using template traffic signs. These description texts contain key information about the shape, color, and composition of traffic signs, which can stimulate the ability of MLLM to perceive fine-grained traffic sign categories. By using the description texts, our method reduces the cross-domain differences between template and real traffic signs. Our approach requires only simple and uniform textual indications, without the need for large-scale traffic sign images and labels. We perform comprehensive evaluations on the German traffic sign recognition benchmark dataset, the Belgium traffic sign dataset, and two real-world datasets taken from Japan. The experimental results show that our method significantly enhances the TSR performance.
Paper Structure (14 sections, 5 equations, 6 figures, 2 tables)

This paper contains 14 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of the proposed cross-domain few-shot in-context learning TSR method. We first perform traffic sign detection and extraction based on the proposed TSD network. Then we use template traffic signs to generate the description texts based on MLLM. The generated description texts contain key information about the shape, color, and composition of traffic signs, thus improving MLLM's reasoning ability for traffic signs.
  • Figure 2: Top-1 recognition results of the proposed cross-domain few-shot in-context learning TSR method. We show samples for the benchmark dataset (GTSRB, BTSD) and the real-world dataset (Sapporo and Yokohama urban road dataset). In the Sapporo and Yokohama urban road datasets samples, we show the process from the TSD in original road images to the recognition of traffic signs by MLLM using the proposed method.
  • Figure 3: Examples of generated description texts for different MLLMs under the unified prompt (GTSRB dataset).
  • Figure 4: Examples of generated description texts for different MLLMs under the unified prompt (Sapporo urban road dataset).
  • Figure 5: Top-1 recognition results of the baseline-o, baseline, and our method.
  • ...and 1 more figures