Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference

Huy-Dung Nguyen; Anass Bairouk; Mirjana Maras; Wei Xiao; Tsun-Hsuan Wang; Patrick Chareyre; Ramin Hasani; Marc Blanchon; Daniela Rus

Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference

Huy-Dung Nguyen, Anass Bairouk, Mirjana Maras, Wei Xiao, Tsun-Hsuan Wang, Patrick Chareyre, Ramin Hasani, Marc Blanchon, Daniela Rus

TL;DR

The paper addresses the challenge that single-task driving models lack robust, contextual perception by proposing a unified multi-task encoder trained on depth, pose, 3D scene flow, and segmentation tasks. It introduces a multi-scale pose decoder and a knowledge-distillation strategy from a multi-encoder teacher to stabilize joint training, enabling efficient multi-task inference. The authors demonstrate competitive per-task performance across depth, pose, flow, and segmentation, and show that steering estimation benefits from the dense latent space when the encoder is frozen, outperforming ImageNet-pretrained baselines. The work highlights the potential of human-like, multi-task visual representations to improve robustness and efficiency in autonomous navigation, and provides a pretrained model for broader adoption.

Abstract

Autonomous driving systems require a comprehensive understanding of the environment, achieved by extracting visual features essential for perception, planning, and control. However, models trained solely on single-task objectives or generic datasets often lack the contextual information needed for robust performance in complex driving scenarios. In this work, we propose a unified encoder trained on multiple computer vision tasks crucial for urban driving, including depth, pose, and 3D scene flow estimation, as well as semantic, instance, panoptic, and motion segmentation. By integrating these diverse visual cues-similar to human perceptual mechanisms-the encoder captures rich features that enhance navigation-related predictions. We evaluate the model on steering estimation as a downstream task, leveraging its dense latent space. To ensure efficient multi-task learning, we introduce a multi-scale feature network for pose estimation and apply knowledge distillation from a multi-backbone teacher model. Our findings highlight two key findings: (1) the unified encoder achieves competitive performance across all visual perception tasks, demonstrating strong generalization capabilities; and (2) for steering estimation, the frozen unified encoder-leveraging dense latent representations-outperforms both its fine-tuned counterpart and the same frozen model pretrained on generic datasets like ImageNet. These results underline the significance of task-specific visual features and demonstrate the promise of multi-task learning in advancing autonomous driving systems. More details and the pretrained model are available at https://hi-computervision.github.io/uni-encoder/.

Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference

TL;DR

Abstract

Paper Structure (19 sections, 5 equations, 3 figures, 4 tables)

This paper contains 19 sections, 5 equations, 3 figures, 4 tables.

Introduction
Related works
Image segmentation
Monocular Depth & Pose Estimation
Scene Flow and Motion Segmentation
Steering estimation
Multi-Task Learning in Computer Vision
Methods
Panoptic, Instance, Semantic Segmentations
Depth, Pose, 3D Flow Estimation, Motion Mask
Towards a Unified Multi-Task Encoder
Efficient Steering Estimation from Dense Latent Space
Experimental results
Training Setup
Ablation Studies
...and 4 more sections

Figures (3)

Figure 1: Our multi-task training strategy. $I_s$, $I_t$, $I_{1 \dots 16}$ represent the source, target, and 16 sequential images, respectively. Their features, denoted as $f_s$, $f_t$, $f_{1 \dots 16}$, are extracted (and concatenated when necessary) using our single encoder.
Figure 2: Simplified architecture of our model: (a) Depth network using target image features $f_t$ to output depth $\mathbf{d}_t$, (b) Multi-scale pose network using source and target image features $f_s, f_t$ to output relative pose $\mathbf{T}_{t \rightarrow s}$, (c) 3D Scene Flow $\mathbf{F}_C$ and Motion mask $\mathbf{M}$ networks using RGB images and features $f_s, f_t$, (d) Segmentation network outputting panoptic, instance, and semantic segmentations, and (e) Loss computation $L_{ssup}$ for joint training of depth, pose, 3D scene flow, and motion mask segmentation. We denote rigid flow $\mathbf{F}_R$, independent flow $\mathbf{F}_I$, final flow, and sampled target image $\hat{\mathbf{I}}_t$.
Figure 3: Qualitative results. Left to right: Input, panoptic, instance, semantic output, depth, motion mask, independant flow.

Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference

TL;DR

Abstract

Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (3)