Table of Contents
Fetching ...

Driving with InternVL: Oustanding Champion in the Track on Driving with Language of the Autonomous Grand Challenge at CVPR 2024

Jiahan Li, Zhiqi Li, Tong Lu

TL;DR

This work tackles driving with language by fine-tuning the open-source multimodal model InternVL-1.5 on the DriveLM-nuScenes dataset, enabling joint perception-language reasoning across multi-view driving scenes. It introduces a practical pipeline that converts object centers to bounding boxes via Segment Anything, concatenates six camera views into a single $2688\times 896$ input, and trains end-to-end with $64$ A100 GPUs at a learning rate of $2\times 10^{-5}$ for one epoch using deepspeed Zero-3, achieving a final score of $0.6002$ on the leaderboard. Temporal fusion experiments are explored but face data-format challenges, while ensemble strategies between v1 and v2 offer pathways to higher performance. Overall, the results demonstrate that open-source multimodal models can achieve competitive perception-language grounding in autonomous driving tasks, with practical implications for scalable, language-informed perception and decision-making in real-world scenarios.

Abstract

This technical report describes the methods we employed for the Driving with Language track of the CVPR 2024 Autonomous Grand Challenge. We utilized a powerful open-source multimodal model, InternVL-1.5, and conducted a full-parameter fine-tuning on the competition dataset, DriveLM-nuScenes. To effectively handle the multi-view images of nuScenes and seamlessly inherit InternVL's outstanding multimodal understanding capabilities, we formatted and concatenated the multi-view images in a specific manner. This ensured that the final model could meet the specific requirements of the competition task while leveraging InternVL's powerful image understanding capabilities. Meanwhile, we designed a simple automatic annotation strategy that converts the center points of objects in DriveLM-nuScenes into corresponding bounding boxes. As a result, our single model achieved a score of 0.6002 on the final leadboard.

Driving with InternVL: Oustanding Champion in the Track on Driving with Language of the Autonomous Grand Challenge at CVPR 2024

TL;DR

This work tackles driving with language by fine-tuning the open-source multimodal model InternVL-1.5 on the DriveLM-nuScenes dataset, enabling joint perception-language reasoning across multi-view driving scenes. It introduces a practical pipeline that converts object centers to bounding boxes via Segment Anything, concatenates six camera views into a single input, and trains end-to-end with A100 GPUs at a learning rate of for one epoch using deepspeed Zero-3, achieving a final score of on the leaderboard. Temporal fusion experiments are explored but face data-format challenges, while ensemble strategies between v1 and v2 offer pathways to higher performance. Overall, the results demonstrate that open-source multimodal models can achieve competitive perception-language grounding in autonomous driving tasks, with practical implications for scalable, language-informed perception and decision-making in real-world scenarios.

Abstract

This technical report describes the methods we employed for the Driving with Language track of the CVPR 2024 Autonomous Grand Challenge. We utilized a powerful open-source multimodal model, InternVL-1.5, and conducted a full-parameter fine-tuning on the competition dataset, DriveLM-nuScenes. To effectively handle the multi-view images of nuScenes and seamlessly inherit InternVL's outstanding multimodal understanding capabilities, we formatted and concatenated the multi-view images in a specific manner. This ensured that the final model could meet the specific requirements of the competition task while leveraging InternVL's powerful image understanding capabilities. Meanwhile, we designed a simple automatic annotation strategy that converts the center points of objects in DriveLM-nuScenes into corresponding bounding boxes. As a result, our single model achieved a score of 0.6002 on the final leadboard.

Paper Structure

This paper contains 7 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Overall Architecture.
  • Figure 2: The concatenated image.