Table of Contents
Fetching ...

Weber-Fechner Law in Temporal Difference learning derived from Control as Inference

Keiichiro Takahashi, Taisuke Kobayashi, Tomoya Yamanokuchi, Takamitsu Matsubara

TL;DR

This work identifies a hidden nonlinear term in TD learning derived from a control-as-inference perspective and shows that it induces Weber-Fechner law–like updates, where perceptual update strength scales logarithmically with the value baseline. By formulating a reward-punishment framework and addressing unknown reward bounds, the authors derive computable gradients that embed δ_ln into value and policy updates, yielding practical WFL behavior. Through analytical (Taylor expansion) and numerical analyses, plus toy and robotic experiments, the paper demonstrates that WFL can accelerate early reward acquisition while suppressing punishments during learning, improving both learning speed and task success in simulated and real-world robotic tasks. These findings offer a biologically plausible mechanism for biased learning in RL and suggest avenues for more robust, efficient policy optimization in uncertain environments.

Abstract

This paper investigates a novel nonlinear update rule based on temporal difference (TD) errors in reinforcement learning (RL). The update rule in the standard RL states that the TD error is linearly proportional to the degree of updates, treating all rewards equally without no bias. On the other hand, the recent biological studies revealed that there are nonlinearities in the TD error and the degree of updates, biasing policies optimistic or pessimistic. Such biases in learning due to nonlinearities are expected to be useful and intentionally leftover features in biological learning. Therefore, this research explores a theoretical framework that can leverage the nonlinearity between the degree of the update and TD errors. To this end, we focus on a control as inference framework, since it is known as a generalized formulation encompassing various RL and optimal control methods. In particular, we investigate the uncomputable nonlinear term needed to be approximately excluded in the derivation of the standard RL from control as inference. By analyzing it, Weber-Fechner law (WFL) is found, namely, perception (a.k.a. the degree of updates) in response to stimulus change (a.k.a. TD error) is attenuated by increase in the stimulus intensity (a.k.a. the value function). To numerically reveal the utilities of WFL on RL, we then propose a practical implementation using a reward-punishment framework and modifying the definition of optimality. Analysis of this implementation reveals that two utilities can be expected i) to increase rewards to a certain level early, and ii) to sufficiently suppress punishment. We finally investigate and discuss the expected utilities through simulations and robot experiments. As a result, the proposed RL algorithm with WFL shows the expected utilities that accelerate the reward-maximizing startup and continue to suppress punishments during learning.

Weber-Fechner Law in Temporal Difference learning derived from Control as Inference

TL;DR

This work identifies a hidden nonlinear term in TD learning derived from a control-as-inference perspective and shows that it induces Weber-Fechner law–like updates, where perceptual update strength scales logarithmically with the value baseline. By formulating a reward-punishment framework and addressing unknown reward bounds, the authors derive computable gradients that embed δ_ln into value and policy updates, yielding practical WFL behavior. Through analytical (Taylor expansion) and numerical analyses, plus toy and robotic experiments, the paper demonstrates that WFL can accelerate early reward acquisition while suppressing punishments during learning, improving both learning speed and task success in simulated and real-world robotic tasks. These findings offer a biologically plausible mechanism for biased learning in RL and suggest avenues for more robust, efficient policy optimization in uncertain environments.

Abstract

This paper investigates a novel nonlinear update rule based on temporal difference (TD) errors in reinforcement learning (RL). The update rule in the standard RL states that the TD error is linearly proportional to the degree of updates, treating all rewards equally without no bias. On the other hand, the recent biological studies revealed that there are nonlinearities in the TD error and the degree of updates, biasing policies optimistic or pessimistic. Such biases in learning due to nonlinearities are expected to be useful and intentionally leftover features in biological learning. Therefore, this research explores a theoretical framework that can leverage the nonlinearity between the degree of the update and TD errors. To this end, we focus on a control as inference framework, since it is known as a generalized formulation encompassing various RL and optimal control methods. In particular, we investigate the uncomputable nonlinear term needed to be approximately excluded in the derivation of the standard RL from control as inference. By analyzing it, Weber-Fechner law (WFL) is found, namely, perception (a.k.a. the degree of updates) in response to stimulus change (a.k.a. TD error) is attenuated by increase in the stimulus intensity (a.k.a. the value function). To numerically reveal the utilities of WFL on RL, we then propose a practical implementation using a reward-punishment framework and modifying the definition of optimality. Analysis of this implementation reveals that two utilities can be expected i) to increase rewards to a certain level early, and ii) to sufficiently suppress punishment. We finally investigate and discuss the expected utilities through simulations and robot experiments. As a result, the proposed RL algorithm with WFL shows the expected utilities that accelerate the reward-maximizing startup and continue to suppress punishments during learning.
Paper Structure (19 sections, 20 equations, 9 figures, 1 table)

This paper contains 19 sections, 20 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Effects of the nonlinear term $\delta_\mathrm{ln}$: when $\delta$ is dominant, the degrees of updates depicted by the contour lines are mostly equally spaced in parallel to the line of $V=Q$; when the influence of $\delta_\mathrm{ln}$ increases, the contour lines radiate out from the upper bound $\overline{R}$.
  • Figure 2: Reward-punishment framework: scalar rewards in a subset of real space are treated by distinguishing between positive and negative ones.
  • Figure 3: Weber-Fechner law with the lower bound: with the large $\lambda_\beta$, the contour lines become narrower with $V$ and $Q$ closer to their lower bound $\underline{R}$, and wider with $V$ and $Q$ farther from $\underline{R}$.
  • Figure 4: Learning results of Pendulum-v0 with different $\beta_0^{+,-}$: the upper and bottom curves depict the learning curves for episodic averages of $r^+$ and $r^-$, respectively; when $\beta_0^{+,-}$ was small, the pursuit of $r^+$ became slower and $r^-$ was preferentially suppressed to zero.
  • Figure 5: Learning results of Pendulum-v0 with the asymmetric $\beta_0^{+,-}$: thanks to the small $\beta_0^- = 0.1$, the deterioration of $r^-$ was restricted; although the exploration was more or less limited, the large $\beta_0^+ = 10$ enabled to maximize $r^+$ to the same level as the conventional method.
  • ...and 4 more figures