Table of Contents
Fetching ...

Distilling Privileged Information for Dubins Traveling Salesman Problems with Neighborhoods

Min Kyu Shin, Su-Jeong Park, Seung-Keol Ryu, Heeyeon Kim, Han-Lim Choi

TL;DR

This work tackles DTSPN for non-holonomic vehicles by introducing DiPDTSP, a two-phase learning framework that distills privileged expert information into a PI-free adaptation network. Phase 1 performs RL fine-tuning with privileged information to train a high-quality policy, while Phase 2 trains an adaptation network to replicate the encoder’s latent representation without any privileged data. The approach achieves substantial speedups (roughly 50×) over LKH-based heuristics and outperforms standard imitation-learning baselines, while reliably sensing all tasks in simulations. By leveraging privileged information during training but operating without it at deployment, DiPDTSP provides fast, robust, sensor-aware DTSPN planning suitable for real-time autonomous navigation.

Abstract

This paper presents a novel learning approach for Dubins Traveling Salesman Problems(DTSP) with Neighborhood (DTSPN) to quickly produce a tour of a non-holonomic vehicle passing through neighborhoods of given task points. The method involves two learning phases: initially, a model-free reinforcement learning approach leverages privileged information to distill knowledge from expert trajectories generated by the LinKernighan heuristic (LKH) algorithm. Subsequently, a supervised learning phase trains an adaptation network to solve problems independently of privileged information. Before the first learning phase, a parameter initialization technique using the demonstration data was also devised to enhance training efficiency. The proposed learning method produces a solution about 50 times faster than LKH and substantially outperforms other imitation learning and RL with demonstration schemes, most of which fail to sense all the task points.

Distilling Privileged Information for Dubins Traveling Salesman Problems with Neighborhoods

TL;DR

This work tackles DTSPN for non-holonomic vehicles by introducing DiPDTSP, a two-phase learning framework that distills privileged expert information into a PI-free adaptation network. Phase 1 performs RL fine-tuning with privileged information to train a high-quality policy, while Phase 2 trains an adaptation network to replicate the encoder’s latent representation without any privileged data. The approach achieves substantial speedups (roughly 50×) over LKH-based heuristics and outperforms standard imitation-learning baselines, while reliably sensing all tasks in simulations. By leveraging privileged information during training but operating without it at deployment, DiPDTSP provides fast, robust, sensor-aware DTSPN planning suitable for real-time autonomous navigation.

Abstract

This paper presents a novel learning approach for Dubins Traveling Salesman Problems(DTSP) with Neighborhood (DTSPN) to quickly produce a tour of a non-holonomic vehicle passing through neighborhoods of given task points. The method involves two learning phases: initially, a model-free reinforcement learning approach leverages privileged information to distill knowledge from expert trajectories generated by the LinKernighan heuristic (LKH) algorithm. Subsequently, a supervised learning phase trains an adaptation network to solve problems independently of privileged information. Before the first learning phase, a parameter initialization technique using the demonstration data was also devised to enhance training efficiency. The proposed learning method produces a solution about 50 times faster than LKH and substantially outperforms other imitation learning and RL with demonstration schemes, most of which fail to sense all the task points.
Paper Structure (25 sections, 5 equations, 4 figures, 1 table)

This paper contains 25 sections, 5 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: The proposed DiPDTSP controls an agent with a sensor to solve DTSPN20
  • Figure 2: DiPDTSP has two training phases. In the first training phase (up), the encoder gets common state $s$ and privileged information $p_e$, which are 4 relative positions and heading angles from expert trajectories. The encoder and policy network trains with model-free RL. In the second training phase (down), the adaptation network distills the encoder network and trains to generate the same latent variable with the encoder by supervised learning. The final adaptation network and policy network generate DTSPN trajectories only with the given position of agent and tasks
  • Figure 3: Average reward over 3M training steps of our method and baselines. DiPDTSP(olive) has a few reward differences from an expert. Due to the early convergence of algorithms, we use the log of time steps in the x-axis.
  • Figure 4: The demonstrations of DiPDTSP and baselines methods. The top and bottom figures show two demonstrations with different initial positions of tasks and agents. Expert trajectories are red dashed lines, and derived agent trajectories are green lines. When it senses the tasks, the light green radius represents the sensor coverage. The baselines get far away from the expert path and show low coverage rates.