Table of Contents
Fetching ...

Domain Adaptation Through Task Distillation

Brady Zhou, Nimit Kalra, Philipp Krähenbühl

TL;DR

The paper tackles domain shift when transferring learning from simulation to real-world environments, a central challenge for autonomous systems due to scarce real-world labeled data. It introduces task distillation, a two-stage approach that first distills a source model into a proxy using a readily available recognition task and then distills a target model from that proxy to operate in the target domain, formalized as $f^P := \mathcal{D}_{\mathcal{S}}(f^S)$ and $f^T := \mathcal{D}_{\mathcal{T}}(f^P)$. By leveraging ground-truth recognition labels instead of end-task labels in the target domain, the method reduces error propagation common in modular pipelines and avoids reliance on imperfect surrogate systems at deployment time, with accurracies roughly characterized by $a^T_{distill} = a^P \, G^L \, a^d$. Empirically, task distillation enables cross-simulator policy transfer (ViZDoom, SuperTuxKart to CARLA) and improves semantic segmentation transfer (SYNTHIA-SF to CARLA/Cityscapes), outperforming baselines. Overall, the framework demonstrates that solving recognition across all domains is not strictly necessary; a well-chosen proxy task can yield robust, end-to-end models in the target domain, broadening practical applicability of simulation-to-reality transfer.

Abstract

Deep networks devour millions of precisely annotated images to build their complex and powerful representations. Unfortunately, tasks like autonomous driving have virtually no real-world training data. Repeatedly crashing a car into a tree is simply too expensive. The commonly prescribed solution is simple: learn a representation in simulation and transfer it to the real world. However, this transfer is challenging since simulated and real-world visual experiences vary dramatically. Our core observation is that for certain tasks, such as image recognition, datasets are plentiful. They exist in any interesting domain, simulated or real, and are easy to label and extend. We use these recognition datasets to link up a source and target domain to transfer models between them in a task distillation framework. Our method can successfully transfer navigation policies between drastically different simulators: ViZDoom, SuperTuxKart, and CARLA. Furthermore, it shows promising results on standard domain adaptation benchmarks.

Domain Adaptation Through Task Distillation

TL;DR

The paper tackles domain shift when transferring learning from simulation to real-world environments, a central challenge for autonomous systems due to scarce real-world labeled data. It introduces task distillation, a two-stage approach that first distills a source model into a proxy using a readily available recognition task and then distills a target model from that proxy to operate in the target domain, formalized as and . By leveraging ground-truth recognition labels instead of end-task labels in the target domain, the method reduces error propagation common in modular pipelines and avoids reliance on imperfect surrogate systems at deployment time, with accurracies roughly characterized by . Empirically, task distillation enables cross-simulator policy transfer (ViZDoom, SuperTuxKart to CARLA) and improves semantic segmentation transfer (SYNTHIA-SF to CARLA/Cityscapes), outperforming baselines. Overall, the framework demonstrates that solving recognition across all domains is not strictly necessary; a well-chosen proxy task can yield robust, end-to-end models in the target domain, broadening practical applicability of simulation-to-reality transfer.

Abstract

Deep networks devour millions of precisely annotated images to build their complex and powerful representations. Unfortunately, tasks like autonomous driving have virtually no real-world training data. Repeatedly crashing a car into a tree is simply too expensive. The commonly prescribed solution is simple: learn a representation in simulation and transfer it to the real world. However, this transfer is challenging since simulated and real-world visual experiences vary dramatically. Our core observation is that for certain tasks, such as image recognition, datasets are plentiful. They exist in any interesting domain, simulated or real, and are easy to label and extend. We use these recognition datasets to link up a source and target domain to transfer models between them in a task distillation framework. Our method can successfully transfer navigation policies between drastically different simulators: ViZDoom, SuperTuxKart, and CARLA. Furthermore, it shows promising results on standard domain adaptation benchmarks.

Paper Structure

This paper contains 20 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Raw visual inputs (a) may significantly vary across different domains, yet they often share common recognition labels (b). In this work, we use these recognition labels to transfer tasks between different domains.
  • Figure 2: Our method first distills a source model to a proxy model that uses labels as inputs. As proxy labels generalize to the target domain, a second stage of distillation is performed to produce a target model.
  • Figure 3: We compare visual domains by their raw monocular images and corresponding semantic representations. While the domains vary significantly in their raw images, they are quite similar in their semantic modalities. However, note that the predicted modalities used by a modular pipeline are not perfect. For example, in the bottom-most row, the map-view prediction fails to capture the yellow car in view directly left of the agent. When supplied to the downstream driving policy, this vision failure can result in unintended behavior.
  • Figure 4: We qualitatively examine how four different driving policies transfer to CARLA. Each policy is evaluated at the same state over four transfer methods, with predicted waypoints shown in red. Inferred modality is displayed for CyCADA and Modular. As shown, an inaccurate modality is used by a modular driving policy when transferring from SuperTuxKart via camera-view semantic segmentation. The median is misclassified as drivable road and the predicted waypoints direct the agent off of the road. (Best viewed on screen.)
  • Figure 5: Performance at different amounts of target-domain training data.
  • ...and 1 more figures