Action Deviation-Aware Inference for Low-Latency Wireless Robots
Jeyoung Park, Yeonsub Lim, Seungeun Oh, Jihong Park, Jinho Choi, Seong-Lyun Kim
TL;DR
The paper tackles latency-critical embodied AI in 6G HRLLC by enabling distributed, cooperative inference between a lightweight on-device draft model and a server-side target model. It introduces Action Deviation-Aware Hybrid Inference (ADAHI), which uses an EMA-based action deviation Δ(t) to predict when server verification is needed, allowing selective speculative sampling for a Vector-Quantized Behavior Transformer (VQ-BeT) trained with a residual VQ-VAE (RQ-VAE). The method yields substantial reductions in uplink transmissions and compute while preserving high task performance, achieving up to 97.2% of the full speculative sampling performance and a 39.2% reduction in end-to-end latency in experiments across manipulation, balancing, and swarm control use cases. These results demonstrate practical viability of on-device intelligence with 6G HRLLC for fast, reliable robotic control in wireless environments, while outlining Extension opportunities beyond VQ-BeT.
Abstract
To support latency-sensitive AI applications ranging from autonomous driving to industrial robot manipulation, 6G envisions distributed ML with computational resources in mobile, edge, and cloud connected over hyper-reliable low-latency communication (HRLLC). In this setting, speculative decoding can facilitate collaborative inference of models distributively deployed: a lightweight on-device model locally generates drafts while a more capable remote target model on a server verifies and corrects them in parallel with speculative sampling, thus resulting in lower latency without compromising accuracy. However, unlike autoregressive text generation, behavior cloning policies, typically used for embodied AI applications, cannot parallelize verification and correction for multiple drafts as each generated action depends on observation updated by a previous action. To this end, we propose Action Deviation-Aware Hybrid Inference (ADAHI), wherein drafts are selectively transmitted and verified based on action deviation, which has a strong correlation with action's rejection probability by the target model. By invoking server operation only when necessary, communication and computational overhead can be reduced while accuracy gain from speculative sampling is preserved. Experiments on our testbed show that ADAHI reduces transmission and server operations by approximately 40%, lowers end-to-end latency by 39.2%, and attains up to 97.2% of the task-success rate of baseline that invokes speculative sampling for every draft embedding vector.
