Table of Contents
Fetching ...

Fourier Controller Networks for Real-Time Decision-Making in Embodied Learning

Hengkai Tan, Songming Liu, Kai Ma, Chengyang Ying, Xingxing Zhang, Hang Su, Jun Zhu

TL;DR

The paper tackles data inefficiency and high inference latency in Transformer-based embodied RL by introducing FCNet, a frequency-domain model that uses STFT-based causal spectral convolution to focus on the low-frequency content of robot trajectories. It achieves parallel training via FFT and recurrent inference via Sliding DFT, yielding training complexity on the order of $O(m n \, log\, n + m^2 n)$ and per-step inference of $O(m)$, with $m \ll n$. Empirically, FCNet matches or exceeds Transformer performance on offline RL benchmarks and a large, diverse legged-robot dataset, while delivering substantially lower inference latency and robust sim-to-real transfer. The work demonstrates that frequency-domain modeling can offer practical gains for real-time, data-efficient embodied learning and provides a foundation for scalable, real-world robotic policies. FCNet’s frequency-biased inductive bias and computational advantages open avenues for pre-training on large robotics corpora and extending to multimodal inputs in future work.

Abstract

Transformer has shown promise in reinforcement learning to model time-varying features for obtaining generalized low-level robot policies on diverse robotics datasets in embodied learning. However, it still suffers from the issues of low data efficiency and high inference latency. In this paper, we propose to investigate the task from a new perspective of the frequency domain. We first observe that the energy density in the frequency domain of a robot's trajectory is mainly concentrated in the low-frequency part. Then, we present the Fourier Controller Network (FCNet), a new network that uses Short-Time Fourier Transform (STFT) to extract and encode time-varying features through frequency domain interpolation. In order to do real-time decision-making, we further adopt FFT and Sliding DFT methods in the model architecture to achieve parallel training and efficient recurrent inference. Extensive results in both simulated (e.g., D4RL) and real-world environments (e.g., robot locomotion) demonstrate FCNet's substantial efficiency and effectiveness over existing methods such as Transformer, e.g., FCNet outperforms Transformer on multi-environmental robotics datasets of all types of sizes (from 1.9M to 120M). The project page and code can be found https://thkkk.github.io/fcnet.

Fourier Controller Networks for Real-Time Decision-Making in Embodied Learning

TL;DR

The paper tackles data inefficiency and high inference latency in Transformer-based embodied RL by introducing FCNet, a frequency-domain model that uses STFT-based causal spectral convolution to focus on the low-frequency content of robot trajectories. It achieves parallel training via FFT and recurrent inference via Sliding DFT, yielding training complexity on the order of and per-step inference of , with . Empirically, FCNet matches or exceeds Transformer performance on offline RL benchmarks and a large, diverse legged-robot dataset, while delivering substantially lower inference latency and robust sim-to-real transfer. The work demonstrates that frequency-domain modeling can offer practical gains for real-time, data-efficient embodied learning and provides a foundation for scalable, real-world robotic policies. FCNet’s frequency-biased inductive bias and computational advantages open avenues for pre-training on large robotics corpora and extending to multimodal inputs in future work.

Abstract

Transformer has shown promise in reinforcement learning to model time-varying features for obtaining generalized low-level robot policies on diverse robotics datasets in embodied learning. However, it still suffers from the issues of low data efficiency and high inference latency. In this paper, we propose to investigate the task from a new perspective of the frequency domain. We first observe that the energy density in the frequency domain of a robot's trajectory is mainly concentrated in the low-frequency part. Then, we present the Fourier Controller Network (FCNet), a new network that uses Short-Time Fourier Transform (STFT) to extract and encode time-varying features through frequency domain interpolation. In order to do real-time decision-making, we further adopt FFT and Sliding DFT methods in the model architecture to achieve parallel training and efficient recurrent inference. Extensive results in both simulated (e.g., D4RL) and real-world environments (e.g., robot locomotion) demonstrate FCNet's substantial efficiency and effectiveness over existing methods such as Transformer, e.g., FCNet outperforms Transformer on multi-environmental robotics datasets of all types of sizes (from 1.9M to 120M). The project page and code can be found https://thkkk.github.io/fcnet.
Paper Structure (40 sections, 13 equations, 10 figures, 8 tables)

This paper contains 40 sections, 13 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Energy density (normalized to $0\%\sim100\%$ for all modes) in the frequency domain of different physical quantities across various motions. We choose $n = 256$ as the context length in the time domain. The time domain data represented in each of the four plots include: (a) Rotational motion with constant angular acceleration. (b) Simple harmonic motion. (c) Body angular velocity during a quadrupedal robot's run. (d) Joint angle of walker2d-expert-v2 in D4RL dataset.
  • Figure 2: The overall model architecture of FCNet. In order to ensure efficient training and inference, we cannot apply Fourier transform to all history trajectories, but instead apply STFT to a window of historical data of length $n$ to filter high-frequency part in the frequency domain, and then apply linear transform and inverse STFT back to the time domain. That is, the CSC block comes in the frequency domain to model temporal features, while the FFN is used to model features on the hidden dimension. $P$ is point-wise encoder and $Q$ is point-wise decoder mentioned in Sec. \ref{['sec:overall_arch']}. $\sigma$ is the activation function.
  • Figure 3: Consider MDP as a sequence modeling problem. Especially in a continuous state space like embodied learning, it makes sense to perform Fourier-based modeling and training in the frequency domain of time-domain sequences within a window $[t, t+n-1]$. In the inference phase $T=t+n-1\rightarrow T=t+n$, the sliding window in Sec. \ref{['sec:inference']} is utilized for efficient inference.
  • Figure 4: The performance of each model on the legged robotics dataset (measured by mean return, averaging the results across 1500*3 trajectories). For all future experiments, the 60M-step dataset is utilized as the standard reference.
  • Figure 5: Deploying FCNet to real-world legged robots.
  • ...and 5 more figures