Table of Contents
Fetching ...

JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration

Yunlong Lin, Zixu Lin, Haoyu Chen, Panwang Pan, Chenxin Li, Sixiang Chen, Yeying Jin, Wenbo Li, Xinghao Ding

TL;DR

JarvisIR tackles the fragility of vision-centric autonomous driving perception under real-world adverse weather by introducing a VLM-powered agent that autonomously coordinates multiple restoration tools. The method combines a synthetic-clean benchmark (CleanBench) with a two-stage training pipeline: supervised fine-tuning (SFT) on synthetic data and human feedback alignment (MRRHF) on real-world data, leveraging a unified IQA reward. Empirical results show JarvisIR surpasses all-in-one baselines and improves perception and decision-making metrics, achieving substantial gains on CleanBench-Real and reduced hallucinations. The work offers a practical framework for robust, scalable tool-augmented perception in deployment, with explicit ablations validating the hybrid sampling and reward-model design, and a roadmap for extending to broader scenarios and higher resolutions.

Abstract

Vision-centric perception systems struggle with unpredictable and coupled weather degradations in the wild. Current solutions are often limited, as they either depend on specific degradation priors or suffer from significant domain gaps. To enable robust and autonomous operation in real-world conditions, we propose JarvisIR, a VLM-powered agent that leverages the VLM as a controller to manage multiple expert restoration models. To further enhance system robustness, reduce hallucinations, and improve generalizability in real-world adverse weather, JarvisIR employs a novel two-stage framework consisting of supervised fine-tuning and human feedback alignment. Specifically, to address the lack of paired data in real-world scenarios, the human feedback alignment enables the VLM to be fine-tuned effectively on large-scale real-world data in an unsupervised manner. To support the training and evaluation of JarvisIR, we introduce CleanBench, a comprehensive dataset consisting of high-quality and large-scale instruction-responses pairs, including 150K synthetic entries and 80K real entries. Extensive experiments demonstrate that JarvisIR exhibits superior decision-making and restoration capabilities. Compared with existing methods, it achieves a 50% improvement in the average of all perception metrics on CleanBench-Real. Project page: https://cvpr2025-jarvisir.github.io/.

JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration

TL;DR

JarvisIR tackles the fragility of vision-centric autonomous driving perception under real-world adverse weather by introducing a VLM-powered agent that autonomously coordinates multiple restoration tools. The method combines a synthetic-clean benchmark (CleanBench) with a two-stage training pipeline: supervised fine-tuning (SFT) on synthetic data and human feedback alignment (MRRHF) on real-world data, leveraging a unified IQA reward. Empirical results show JarvisIR surpasses all-in-one baselines and improves perception and decision-making metrics, achieving substantial gains on CleanBench-Real and reduced hallucinations. The work offers a practical framework for robust, scalable tool-augmented perception in deployment, with explicit ablations validating the hybrid sampling and reward-model design, and a roadmap for extending to broader scenarios and higher resolutions.

Abstract

Vision-centric perception systems struggle with unpredictable and coupled weather degradations in the wild. Current solutions are often limited, as they either depend on specific degradation priors or suffer from significant domain gaps. To enable robust and autonomous operation in real-world conditions, we propose JarvisIR, a VLM-powered agent that leverages the VLM as a controller to manage multiple expert restoration models. To further enhance system robustness, reduce hallucinations, and improve generalizability in real-world adverse weather, JarvisIR employs a novel two-stage framework consisting of supervised fine-tuning and human feedback alignment. Specifically, to address the lack of paired data in real-world scenarios, the human feedback alignment enables the VLM to be fine-tuned effectively on large-scale real-world data in an unsupervised manner. To support the training and evaluation of JarvisIR, we introduce CleanBench, a comprehensive dataset consisting of high-quality and large-scale instruction-responses pairs, including 150K synthetic entries and 80K real entries. Extensive experiments demonstrate that JarvisIR exhibits superior decision-making and restoration capabilities. Compared with existing methods, it achieves a 50% improvement in the average of all perception metrics on CleanBench-Real. Project page: https://cvpr2025-jarvisir.github.io/.

Paper Structure

This paper contains 29 sections, 11 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: Limitations of single-task methods, all-in-one methods, and inaccurate task order. (a) Single-task specific and all-in-one methods fail to address coupled degradation in real-world scenarios. (b) Collaboration among multi-expert models effectively mitigates complex degradation, but is sensitive to the order of tasks. Unlike these approaches, JarvisIR can dynamically schedule different expert models in response to the rapidly changing scenarios and coupled degradation in the wild.
  • Figure 2: The dataset construction workflow consists of three main steps: 1) Synthesis of degraded images. 2) Generation of Assessment reasoning and the optimal task sequence. 3) Generation of instruction-response pairs for the system.
  • Figure 3: Examples of CleanBench-Real dataset.
  • Figure 4: The workflow of JarvisIR. To address real-world coupled weather degradation, we develop JarvisIR, a VLM-powered intelligent system that dynamically schedules expert models for restoration. Initially, JarvisIR assesses the degradation of the input images and parses user instructions to formulate a task plan, selecting the appropriate expert models for each subtask. The selected experts perform their designated tasks and return the results to JarvisIR, which integrates the outcomes and provides the final answer to the user. The design of the figure is inspired by shen2024hugginggpt.
  • Figure 5: Two-stage training framework of JarvisIR. In the first stage, JarvisIR undergoes supervised fine-tuning on synthetic data from CleanBench to enable it to follow user instructions and recognize image degradation. In the second stage, we further fine-tune JarvisIR on CleanBench-Real using the MRRHF algorithm to improve system robustness, reduce hallucinations, and enhance generalizability under real-world adverse weather conditions.
  • ...and 10 more figures