Table of Contents
Fetching ...

Dynamics-Aligned Shared Hypernetworks for Zero-Shot Actuator Inversion

Jan Benad, Pradeep Kr. Banerjee, Frank Röder, Nihat Ay, Martin V. Butz, Manfred Eppe

TL;DR

DMA*-SH is proposed, a framework where a single hypernetwork, trained solely via dynamics prediction, generates a small set of adapter weights shared across the dynamics model, policy, and action-value function, which imparts an inductive bias matched to actuator inversion.

Abstract

Zero-shot generalization in contextual reinforcement learning remains a core challenge, particularly when the context is latent and must be inferred from data. A canonical failure mode is actuator inversion, where identical actions produce opposite physical effects under a latent binary context. We propose DMA*-SH, a framework where a single hypernetwork, trained solely via dynamics prediction, generates a small set of adapter weights shared across the dynamics model, policy, and action-value function. This shared modulation imparts an inductive bias matched to actuator inversion, while input/output normalization and random input masking stabilize context inference, promoting directionally concentrated representations. We provide theoretical support via an expressivity separation result for hypernetwork modulation, and a variance decomposition with policy-gradient variance bounds that formalize how within-mode compression improves learning under actuator inversion. For evaluation, we introduce the Actuator Inversion Benchmark (AIB), a suite of environments designed to isolate discontinuous context-to-dynamics interactions. On AIB's held-out actuator-inversion tasks, DMA*-SH achieves zero-shot generalization, outperforming domain randomization by 111.8% and surpassing a standard context-aware baseline by 16.1%.

Dynamics-Aligned Shared Hypernetworks for Zero-Shot Actuator Inversion

TL;DR

DMA*-SH is proposed, a framework where a single hypernetwork, trained solely via dynamics prediction, generates a small set of adapter weights shared across the dynamics model, policy, and action-value function, which imparts an inductive bias matched to actuator inversion.

Abstract

Zero-shot generalization in contextual reinforcement learning remains a core challenge, particularly when the context is latent and must be inferred from data. A canonical failure mode is actuator inversion, where identical actions produce opposite physical effects under a latent binary context. We propose DMA*-SH, a framework where a single hypernetwork, trained solely via dynamics prediction, generates a small set of adapter weights shared across the dynamics model, policy, and action-value function. This shared modulation imparts an inductive bias matched to actuator inversion, while input/output normalization and random input masking stabilize context inference, promoting directionally concentrated representations. We provide theoretical support via an expressivity separation result for hypernetwork modulation, and a variance decomposition with policy-gradient variance bounds that formalize how within-mode compression improves learning under actuator inversion. For evaluation, we introduce the Actuator Inversion Benchmark (AIB), a suite of environments designed to isolate discontinuous context-to-dynamics interactions. On AIB's held-out actuator-inversion tasks, DMA*-SH achieves zero-shot generalization, outperforming domain randomization by 111.8% and surpassing a standard context-aware baseline by 16.1%.
Paper Structure (74 sections, 6 theorems, 48 equations, 27 figures, 9 tables, 2 algorithms)

This paper contains 74 sections, 6 theorems, 48 equations, 27 figures, 9 tables, 2 algorithms.

Key Result

Theorem 1.1

Let $\mathcal{H}_{\mathrm{concat}}$ be the class of functions $f: \mathbb{R}^{d_s} \times \mathbb{R}^{d_z} \to \mathbb{R}$ realized by finite ReLU MLPs on input $[s; z]$. Let $\mathcal{H}_{\mathrm{hyper}}$ be the class of functions of the form where: Assume $s \in \mathcal{S} \subset \mathbb{R}^{d_s}$ and $z \in \mathcal{Z} \subset \mathbb{R}^{d_z}$ range over compact sets with non-empty interio

Figures (27)

  • Figure 1: (a) In vanilla DMA, the inferred context $z_t$ is concatenated to the RL inputs. (b) In DMA*-SH, a hypernetwork $h_\eta$ conditioned on $z_t$ generates adapter weights $\omega$ that are used by the forward dynamics model and the RL networks. The context encoder and hypernetwork are trained via the reconstruction objective $L_{\phi,\theta,\eta}$ in \ref{['eq:hnopt']}, while during RL updates gradients through $\omega=h_\eta(z_t)$ are stopped so that the policy and critic losses do not backpropagate to $\eta$ (or to $z_t$) through the shared adapter pathway $z_t \to \omega \to (\pi, Q)$.
  • Figure 2: Interquartile mean (IQM) with $95\%$ confidence intervals computed from AER scores, aggregated across environments. Results are reported for the three context sets $\mathcal{C}_{\text{train}}$, $\mathcal{C}_{\text{eval-in}}$, and $\mathcal{C}_{\text{eval-out}}$, comparing DMA*, DMA*-SH, and all baselines.
  • Figure 3: Informativeness $I(z_t; c)$, Variability, Representation-Overlap ($\mathrm{RO}$), and episodic returns for the context set $\mathcal{C}_{\text{eval-out}}$.
  • Figure 4: DI (non-overlapping): t-SNE of inferred embeddings $z_t$ (top) and cosine similarity heatmaps of per-context mean embeddings (bottom; \ref{['eq:pairwise_cosim']}) for DMA, DMA*, and DMA*-SH. DMA*-SH shows stronger within-mode alignment across the continuous mass dimension while maintaining separation between actuator-inversion modes, consistent with the compression/separation terms in Theorem \ref{['thm:var_decomp_SU']}. Mass clusters overlap more for DMA*-SH, yet returns are higher, consistent with mass having largely overlapping policy effects.
  • Figure 5: Implicit regularization of RL via a dynamics-trained shared hypernetwork. Top: Mean policy context sensitivity in shared embedding space $\mathbb{E}\|\nabla_z L_\pi\|$. Bottom: Episodic returns.
  • ...and 22 more figures

Theorems & Definitions (23)

  • Theorem 1.1: Separation of hypernetwork-adapter and concatenation hypothesis classes
  • proof
  • Remark 1.2: SAC with a shared hypernetwork-conditioned bottleneck adapter
  • Remark 1.3: Hypernetwork advantage for actuator inversion
  • Remark 1.4: Parameter complexity
  • Definition 1.5: Overlapping and Non-Overlapping Contexts
  • Definition 1.6: Actuator inversion
  • Lemma 1.7: Failure of DR under actuator inversion
  • proof
  • Remark 1.8: Context-unaware policies are epistemic POMDP solvers
  • ...and 13 more