Table of Contents
Fetching ...

VertiCoder: Self-Supervised Kinodynamic Representation Learning on Vertically Challenging Terrain

Mohammad Nazeri, Aniket Datar, Anuj Pokhrel, Chenhui Pan, Garrett Warnell, Xuesu Xiao

TL;DR

VertiCoder addresses the challenge of kinodynamic understanding for robots on vertically challenging terrain by learning a general terrain representation through self-supervised learning. It uses a TransformerEncoder with a context token to mask tokens and predict the next terrain patch, enabling four downstream tasks: Forward Kinodynamics Learning ($FKD$), Inverse Kinodynamics Learning ($IKD$), Behavior Cloning ($BC$), and Patch Reconstruction ($PR$) from a single representation. The approach achieves better generalization than specialized End-to-End models while using roughly $77\%$ fewer parameters, and shows competitive performance in real-world deployment against state-of-the-art kinodynamic methods. This work demonstrates that SSL can mitigate overfitting and improve robustness across diverse environments and tasks in off-road robot mobility.

Abstract

We present VertiCoder, a self-supervised representation learning approach for robot mobility on vertically challenging terrain. Using the same pre-training process, VertiCoder can handle four different downstream tasks, including forward kinodynamics learning, inverse kinodynamics learning, behavior cloning, and patch reconstruction with a single representation. VertiCoder uses a TransformerEncoder to learn the local context of its surroundings by random masking and next patch reconstruction. We show that VertiCoder achieves better performance across all four different tasks compared to specialized End-to-End models with 77% fewer parameters. We also show VertiCoder's comparable performance against state-of-the-art kinodynamic modeling and planning approaches in real-world robot deployment. These results underscore the efficacy of VertiCoder in mitigating overfitting and fostering more robust generalization across diverse environmental contexts and downstream vehicle kinodynamic tasks.

VertiCoder: Self-Supervised Kinodynamic Representation Learning on Vertically Challenging Terrain

TL;DR

VertiCoder addresses the challenge of kinodynamic understanding for robots on vertically challenging terrain by learning a general terrain representation through self-supervised learning. It uses a TransformerEncoder with a context token to mask tokens and predict the next terrain patch, enabling four downstream tasks: Forward Kinodynamics Learning (), Inverse Kinodynamics Learning (), Behavior Cloning (), and Patch Reconstruction () from a single representation. The approach achieves better generalization than specialized End-to-End models while using roughly fewer parameters, and shows competitive performance in real-world deployment against state-of-the-art kinodynamic methods. This work demonstrates that SSL can mitigate overfitting and improve robustness across diverse environments and tasks in off-road robot mobility.

Abstract

We present VertiCoder, a self-supervised representation learning approach for robot mobility on vertically challenging terrain. Using the same pre-training process, VertiCoder can handle four different downstream tasks, including forward kinodynamics learning, inverse kinodynamics learning, behavior cloning, and patch reconstruction with a single representation. VertiCoder uses a TransformerEncoder to learn the local context of its surroundings by random masking and next patch reconstruction. We show that VertiCoder achieves better performance across all four different tasks compared to specialized End-to-End models with 77% fewer parameters. We also show VertiCoder's comparable performance against state-of-the-art kinodynamic modeling and planning approaches in real-world robot deployment. These results underscore the efficacy of VertiCoder in mitigating overfitting and fostering more robust generalization across diverse environmental contexts and downstream vehicle kinodynamic tasks.
Paper Structure (12 sections, 4 equations, 3 figures, 4 tables)

This paper contains 12 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: VertiCoder uses a Transformer to encode a history of terrain patches, robot actions, and robot poses and predicts the representation of the next patch. The learned representation is used in multiple downstream tasks.
  • Figure 2: VertiCoder Architecture. VertiCoder employs a TransformerEncoder with a context token to predict the representation of the next patch beneath the robot. It achieves this through two simultaneous pretext tasks: predicting the next patch in a sequence of patches, actions, and poses, and reconstructing randomly masked tokens ( ) within that sequence. Subsequently, we freeze the VertiCoder and attach multiple downstream task heads to the context token, exclusively training the downstream task heads.
  • Figure 3: Rock Testbed and V4W used for Data Collection and Experiments. The modularity of the testbed allows diverse rock configurations for training and evaluation.