CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario

Zhizhao Duan; Hao Cheng; Duo Xu; Xi Wu; Xiangxie Zhang; Xi Ye; Zhen Xie

CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario

Zhizhao Duan, Hao Cheng, Duo Xu, Xi Wu, Xiangxie Zhang, Xi Ye, Zhen Xie

TL;DR

CityLLaVA tackles the challenge of adapting large visual-language models to urban traffic tasks by integrating bounding-box guided view selection, visual and textual prompt engineering, short QA construction, and efficient block-expansion fine-tuning. A key innovation is the combination of global and locally cropped visual inputs, guided by bounding boxes, plus carefully crafted prompts and concise QA pairs to improve fine-grained scene understanding. Sequential questioning during inference (notably Vehicle→Pedestrian) augments predictions and yields superior accuracy, while RLHF with DPO is found ineffective for this task. The approach achieves state-of-the-art performance on the WTS-based AI City Challenge Track 2 with a score of 33.4308, demonstrating strong practical impact for urban safety description and analysis tasks and providing a reproducible codebase.

Abstract

In the vast and dynamic landscape of urban settings, Traffic Safety Description and Analysis plays a pivotal role in applications ranging from insurance inspection to accident prevention. This paper introduces CityLLaVA, a novel fine-tuning framework for Visual Language Models (VLMs) designed for urban scenarios. CityLLaVA enhances model comprehension and prediction accuracy through (1) employing bounding boxes for optimal visual data preprocessing, including video best-view selection and visual prompt engineering during both training and testing phases; (2) constructing concise Question-Answer sequences and designing textual prompts to refine instruction comprehension; (3) implementing block expansion to fine-tune large VLMs efficiently; and (4) advancing prediction accuracy via a unique sequential questioning-based prediction augmentation. Demonstrating top-tier performance, our method achieved a benchmark score of 33.4308, securing the leading position on the leaderboard. The code can be found: https://github.com/alibaba/AICITY2024_Track2_AliOpenTrek_CityLLaVA

CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario

TL;DR

Abstract

Paper Structure (19 sections, 7 equations, 4 figures, 7 tables)

This paper contains 19 sections, 7 equations, 4 figures, 7 tables.

Introduction
Related Work
Vision-Language Models
VLMs in Driving
Methodology
Overview
Dataset Construction
Bounding-box Guided View Selection
Visual Prompt Engineering
Textual Prompt Engineering
Short QA Construction
Model Architecture
Harnessing Sequential Questioning
Experiments
Dataset
...and 4 more sections

Figures (4)

Figure 1: The efficient fine-tuning paradigm for VLMs. The efficient fine-tuning paradigm for VLMs. The paradigm first executes the prompt engineering, which includes visual and textual prompt engineering. Then continuous training including SFT and RLHF is implemented based on the pretrained VLMs. Finally, the Inference augmentation is used to improve performance.
Figure 2: The overview of CityLLaVA. Our method is anchored on the pretrained LLaVA-1.6-34B liu2024llavanext equipped with block expansion wu2024llama, combining the textual prompt engineering and visual prompt engineering guided by bounding boxes.
Figure 3: Examples of usages of visual prompt (Top) and cropped view guided by bounding boxes (Bottom).
Figure 4: Training loss curves of models with multi-round and single-round QA.

CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario

TL;DR

Abstract

CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario

Authors

TL;DR

Abstract

Table of Contents

Figures (4)