Table of Contents
Fetching ...

Neural Implicit Action Fields: From Discrete Waypoints to Continuous Functions for Vision-Language-Action Models

Haoyun Liu, Jianzhuang Zhao, Xinyuan Chang, Tianle Shi, Chuanzhang Meng, Jiayuan Tan, Feng Xiong, Tong Lin, Dongjie Huo, Mu Xu, SongLin Dong, Zhiheng Ma, Yihong Gong, Sheng Zhong

TL;DR

This work proposes Neural Implicit Action Fields (NIAF), a paradigm shift that reformulates action prediction from discrete waypoints to continuous action function regression, utilizing an MLLM as a hierarchical spectral modulator over a learnable motion prior to synthesize infinite-resolution trajectories as continuous-time manifolds.

Abstract

Despite the rapid progress of Vision-Language-Action (VLA) models, the prevailing paradigm of predicting discrete waypoints remains fundamentally misaligned with the intrinsic continuity of physical motion. This discretization imposes rigid sampling rates, lacks high-order differentiability, and introduces quantization artifacts that hinder precise, compliant interaction. We propose Neural Implicit Action Fields (NIAF), a paradigm shift that reformulates action prediction from discrete waypoints to continuous action function regression. By utilizing an MLLM as a hierarchical spectral modulator over a learnable motion prior, NIAF synthesizes infinite-resolution trajectories as continuous-time manifolds. This formulation enables analytical differentiability, allowing for explicit supervision of velocity, acceleration, and jerk to ensure mathematical consistency and physical plausibility. Our approach achieves state-of-the-art results on CALVIN and LIBERO benchmarks across diverse backbones. Furthermore, real-world experiments demonstrate that NIAF enables stable impedance control, bridging the gap between high-level semantic understanding and low-level dynamic execution.

Neural Implicit Action Fields: From Discrete Waypoints to Continuous Functions for Vision-Language-Action Models

TL;DR

This work proposes Neural Implicit Action Fields (NIAF), a paradigm shift that reformulates action prediction from discrete waypoints to continuous action function regression, utilizing an MLLM as a hierarchical spectral modulator over a learnable motion prior to synthesize infinite-resolution trajectories as continuous-time manifolds.

Abstract

Despite the rapid progress of Vision-Language-Action (VLA) models, the prevailing paradigm of predicting discrete waypoints remains fundamentally misaligned with the intrinsic continuity of physical motion. This discretization imposes rigid sampling rates, lacks high-order differentiability, and introduces quantization artifacts that hinder precise, compliant interaction. We propose Neural Implicit Action Fields (NIAF), a paradigm shift that reformulates action prediction from discrete waypoints to continuous action function regression. By utilizing an MLLM as a hierarchical spectral modulator over a learnable motion prior, NIAF synthesizes infinite-resolution trajectories as continuous-time manifolds. This formulation enables analytical differentiability, allowing for explicit supervision of velocity, acceleration, and jerk to ensure mathematical consistency and physical plausibility. Our approach achieves state-of-the-art results on CALVIN and LIBERO benchmarks across diverse backbones. Furthermore, real-world experiments demonstrate that NIAF enables stable impedance control, bridging the gap between high-level semantic understanding and low-level dynamic execution.
Paper Structure (19 sections, 13 equations, 7 figures, 6 tables)

This paper contains 19 sections, 13 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: From Discrete Waypoints to Continuous Functions. Prevalent methods learn temporally discrete waypoints bound to a fixed action sampling rate. We instead model the action trajectory as a continuous time function, which offers two key advantages: 1) Resolution Independence, enabling querying at arbitrary control frequencies without interpolation artifacts; 2) Analytical Differentiability, which allows for explicit velocity supervision and jerk regularization, providing precise and dynamically feasible profiles required for impedance control.
  • Figure 2: The NIAF architecture. Instead of predicting discrete waypoints, we reformulate action generation as function regression. Operating as a hypernetwork, the MLLM serves as a hierarchical spectral modulator, transforming learnable query embeddings into modulation vectors conditioned on the multimodal context via one-step parallel decoding. These vectors dynamically reconfigure the shared meta parameters of a SIREN. Consequently, the instantiated SIREN enables querying high-fidelity actions at arbitrary frequencies by simply sampling the continuous time domain $\tau$.
  • Figure 3: Performance comparison of different action representations under an identical Florence-2 Large backbone on the most challenging settings.
  • Figure 4: Experimental Results on Real-World Robot Tasks. This figure shows the average task success rate across four real-world tasks.
  • Figure 5: Comparison of control dynamics across different methods. The top row shows joint position tracking, and the bottom row visualizes velocity and acceleration profiles. (a) & (b) Baselines (BEAST & OFT): The velocity profiles exhibit high-frequency oscillations hovering around zero, indicating disjointed stop-and-go motion and poor temporal coherence. (c) NIAF (Ours, Impedance Control): In contrast, our method produces a continuous, trend-following velocity reference (red line) that maintains consistent motion without reverting to zero-mean noise. The executed velocity (blue) aligns tightly with this analytical signal, demonstrating effective impedance tracking.
  • ...and 2 more figures