Table of Contents
Fetching ...

Action Deviation-Aware Inference for Low-Latency Wireless Robots

Jeyoung Park, Yeonsub Lim, Seungeun Oh, Jihong Park, Jinho Choi, Seong-Lyun Kim

TL;DR

The paper tackles latency-critical embodied AI in 6G HRLLC by enabling distributed, cooperative inference between a lightweight on-device draft model and a server-side target model. It introduces Action Deviation-Aware Hybrid Inference (ADAHI), which uses an EMA-based action deviation Δ(t) to predict when server verification is needed, allowing selective speculative sampling for a Vector-Quantized Behavior Transformer (VQ-BeT) trained with a residual VQ-VAE (RQ-VAE). The method yields substantial reductions in uplink transmissions and compute while preserving high task performance, achieving up to 97.2% of the full speculative sampling performance and a 39.2% reduction in end-to-end latency in experiments across manipulation, balancing, and swarm control use cases. These results demonstrate practical viability of on-device intelligence with 6G HRLLC for fast, reliable robotic control in wireless environments, while outlining Extension opportunities beyond VQ-BeT.

Abstract

To support latency-sensitive AI applications ranging from autonomous driving to industrial robot manipulation, 6G envisions distributed ML with computational resources in mobile, edge, and cloud connected over hyper-reliable low-latency communication (HRLLC). In this setting, speculative decoding can facilitate collaborative inference of models distributively deployed: a lightweight on-device model locally generates drafts while a more capable remote target model on a server verifies and corrects them in parallel with speculative sampling, thus resulting in lower latency without compromising accuracy. However, unlike autoregressive text generation, behavior cloning policies, typically used for embodied AI applications, cannot parallelize verification and correction for multiple drafts as each generated action depends on observation updated by a previous action. To this end, we propose Action Deviation-Aware Hybrid Inference (ADAHI), wherein drafts are selectively transmitted and verified based on action deviation, which has a strong correlation with action's rejection probability by the target model. By invoking server operation only when necessary, communication and computational overhead can be reduced while accuracy gain from speculative sampling is preserved. Experiments on our testbed show that ADAHI reduces transmission and server operations by approximately 40%, lowers end-to-end latency by 39.2%, and attains up to 97.2% of the task-success rate of baseline that invokes speculative sampling for every draft embedding vector.

Action Deviation-Aware Inference for Low-Latency Wireless Robots

TL;DR

The paper tackles latency-critical embodied AI in 6G HRLLC by enabling distributed, cooperative inference between a lightweight on-device draft model and a server-side target model. It introduces Action Deviation-Aware Hybrid Inference (ADAHI), which uses an EMA-based action deviation Δ(t) to predict when server verification is needed, allowing selective speculative sampling for a Vector-Quantized Behavior Transformer (VQ-BeT) trained with a residual VQ-VAE (RQ-VAE). The method yields substantial reductions in uplink transmissions and compute while preserving high task performance, achieving up to 97.2% of the full speculative sampling performance and a 39.2% reduction in end-to-end latency in experiments across manipulation, balancing, and swarm control use cases. These results demonstrate practical viability of on-device intelligence with 6G HRLLC for fast, reliable robotic control in wireless environments, while outlining Extension opportunities beyond VQ-BeT.

Abstract

To support latency-sensitive AI applications ranging from autonomous driving to industrial robot manipulation, 6G envisions distributed ML with computational resources in mobile, edge, and cloud connected over hyper-reliable low-latency communication (HRLLC). In this setting, speculative decoding can facilitate collaborative inference of models distributively deployed: a lightweight on-device model locally generates drafts while a more capable remote target model on a server verifies and corrects them in parallel with speculative sampling, thus resulting in lower latency without compromising accuracy. However, unlike autoregressive text generation, behavior cloning policies, typically used for embodied AI applications, cannot parallelize verification and correction for multiple drafts as each generated action depends on observation updated by a previous action. To this end, we propose Action Deviation-Aware Hybrid Inference (ADAHI), wherein drafts are selectively transmitted and verified based on action deviation, which has a strong correlation with action's rejection probability by the target model. By invoking server operation only when necessary, communication and computational overhead can be reduced while accuracy gain from speculative sampling is preserved. Experiments on our testbed show that ADAHI reduces transmission and server operations by approximately 40%, lowers end-to-end latency by 39.2%, and attains up to 97.2% of the task-success rate of baseline that invokes speculative sampling for every draft embedding vector.

Paper Structure

This paper contains 12 sections, 10 equations, 4 figures, 1 table, 1 algorithm.

Figures (4)

  • Figure 1: Architecture of Action Deviation-Aware Hybrid Inference. (a) shows VQ-BeT operations: for a given observation $o_t$, the code predictor head generates the probability distribution over the embedding vectors of the codebooks and the offset head, an MLP, computes a small offset [18]. (b) For action generated by on-device VQ-BeT draft model, action deviation $\Delta(t)$ is computed as euclidean distance from exponential moving average of past actions. (c) When $\Delta(t)>\Delta_{th}$, the embedding vectors $d_1 \dots d_n$, the draft model's probability distribution $\mathbf{q}_t$, and observation $o_t$ are transmitted. With these, speculative sampling for each embedding vector occurs and finalized continuous action is transmitted back to the local device.
  • Figure 2: Plots showing the relationship between action deviation and rejection probability for three use cases. Each plot results in correlation coefficient of at least 0.9457 and includes more than 50,000 actions.
  • Figure 3: Use cases for ADAHI (from left to right): Kitchen Environment Manipulation, Ball Balancing, and Swarm Control.
  • Figure 4: Average per-action latency, task success rate, and action throughput breakdown for inference methods (left); CDF of action throughput for ADAHI and hybrid inference where dotted lines mark each method’s 2.5th percentile (right).