Biological Neurons Compete with Deep Reinforcement Learning in Sample Efficiency in a Simulated Gameworld

Moein Khajehnejad; Forough Habibollahi; Aswin Paul; Adeel Razi; Brett J. Kagan

Biological Neurons Compete with Deep Reinforcement Learning in Sample Efficiency in a Simulated Gameworld

Moein Khajehnejad, Forough Habibollahi, Aswin Paul, Adeel Razi, Brett J. Kagan

TL;DR

The study compares sample efficiency between DishBrain in vitro neural cultures and three deep reinforcement learning algorithms (DQN, A2C, PPO) in a Pong-like task under identical real-time sample budgets. Biological cultures demonstrate superior learning speed and performance across multiple input densities, suggesting higher sample efficiency than contemporary RL methods. The discussion situates these results within broader debates on biologically plausible learning mechanisms and highlights active inference as a promising, biologically inspired alternative. Methodologically, the work combines a high-density MEA-based closed-loop platform with varied input encodings and extensive RL hyperparameter exploration, pointing to SBI systems as a compelling direction for real-time, energy-efficient learning with potential implications for AI algorithm development.

Abstract

How do biological systems and machine learning algorithms compare in the number of samples required to show significant improvements in completing a task? We compared the learning efficiency of in vitro biological neural networks to the state-of-the-art deep reinforcement learning (RL) algorithms in a simplified simulation of the game `Pong'. Using DishBrain, a system that embodies in vitro neural networks with in silico computation using a high-density multi-electrode array, we contrasted the learning rate and the performance of these biological systems against time-matched learning from three state-of-the-art deep RL algorithms (i.e., DQN, A2C, and PPO) in the same game environment. This allowed a meaningful comparison between biological neural systems and deep RL. We find that when samples are limited to a real-world time course, even these very simple biological cultures outperformed deep RL algorithms across various game performance characteristics, implying a higher sample efficiency. Ultimately, even when tested across multiple types of information input to assess the impact of higher dimensional data input, biological neurons showcased faster learning than all deep reinforcement learning agents.

Biological Neurons Compete with Deep Reinforcement Learning in Sample Efficiency in a Simulated Gameworld

TL;DR

Abstract

Paper Structure (22 sections, 5 equations, 14 figures, 1 table, 3 algorithms)

This paper contains 22 sections, 5 equations, 14 figures, 1 table, 3 algorithms.

Introduction
Results
Comparison in performance between DishBrain and three RL algorithms with various information densities
Examining impact of paddle movement speed on learning rates
Discussion
Methods
DishBrain System
Deep Reinforcement Learning Algorithms
Data Availability
Code Availability
Supplementary information
Acknowledgments
Competing interests
Author contributions
Supplementary Materials
...and 7 more sections

Figures (14)

Figure 1: DishBrain system and Various input designs to RL algorithms.a)DishBrain feedback loop setup and Electrode configuration and predefined sensory and motor regions. Figures adapted and modified from kagan2022vitro. b) Schematic comparing the information input routes in the DishBrain system (left) and the three implementations of the deep RL algorithms (right). In each design, the input information to the computing module (deep RL algorithms or DishBrain) is denoted by a vector $I$.
Figure 2: Image Input to the deep RL algorithms.a) Schematic highlighting figure comparisons are between biological DishBrain system and an pixel-based information input to te RL algorithms. Average number of b) hits-per-rally, c)$\%$ of aces, and d)$\%$ of long rallies over 20 minutes real-time equivalent of training DQN, A2C, PPO, and MCC, HCC cultures. A regressor line on the mean values with a 95% confidence interval highlights the learning trends. Comparing the performance amongst all groups, the highest level of average hits-per-rally is achieved by the neuronal MCC and HCC cultures while PPO is outperformed by all the opponents. The average $\%$ of aces is lowest for the neuronal cultures compared to all deep RL baseline methods. The average $\%$ of long rallies reaches its highest levels for MCC and HCC. e) Average performance of groups over time. Only biological cultures have significant within-group improvement and increase in their performance at the second time interval (One-way ANOVA test, p = 5.854e-6, p = 7.936e-17, for MCC and HCC respectively; p = 0.231, p = 0.318, and p = 0.400 for DQN, A2C, and PPO respectively). f) Average % of aces within groups and over time. Only MCC and HCC (One-way ANOVA test, p = 0.014, p = 2.907e-08, respectively) differed significantly over time. No significant change was detected within the DQN, A2C, or PPO groups (One-way ANOVA test, p = 0.080, p = 0.195, and p = 0.308, respectively). g) Average % of long-rallies ($\geq$ 3) performed in a session. All groups showed an increase in the average number of long rallies where this within-group increase was significant only for MCC, HCC, and A2C (One-way ANOVA test, p = 1.172e-7, p = 1.525e-24 for MCC and HCC, respectively and p = 0.605, p = 0.002, and p = 0.684 for DQN, A2C, and PPO, respectively). *$p < 0.05$, **$p<0.01$, and ***$p < 0.001$. h) Pairwise Tukey's post-hoc test shows that HCC and MCC groups significantly outperform PPO, A2C, and DQN in the last 15 minutes interval. i) Using pairwise Tukey's post-hoc test, the HCC group significantly outperforms the PPO in the last 15 minutes interval with a lower average of % Aces. A2C also outperforms PPO in this time interval. j) Pairwise comparison using Tukey's test only shows a significant difference in the percentage of long rallies between HCC and the rest of the groups in the first 5 minutes. However, this is later altered in the direction of all groups having an increased % of long rallies with MCC outperforming PPO in the last 15 minutes of the game. Box plots show interquartile range, with bars demonstrating 1.5X interquartile range, the line marks the median and the black triangle marks the mean. Error bands = 1 SE
Figure 3: Paddle&Ball Position Input to the deep RL algorithms.a) Schematic highlighting figure comparisons are between biological DishBrain system and Paddle&Ball Position Input to RL algorithms. Average number of b) hits-per-rally, c)$\%$ of aces, and d)$\%$ of long rallies over 20 minutes real-time equivalent of training DQN, A2C, PPO, and MCC, HCC cultures. A regressor line on the mean values with a 95% confidence interval highlights the learning trends. The highest level of average hits-per-rally is achieved by the MCC and HCC cultures. The average $\%$ of aces is lowest for the neuronal cultures compared to all deep RL baseline methods. The average $\%$ of long rallies reaches its highest levels for MCC and HCC. e) Average rally length over time only showed a significant increase in the biological cultures between the two time intervals (One-way ANOVA test, p = 0.913, p = 0.958, and p = 0.610 for DQN, A2C, and PPO respectively). f) Average $\%$ of aces within groups and over time only showed a significant difference in the MCC and HCC groups. No significant change was detected within the DQN, A2C, or PPO groups (One-way ANOVA test, p = 0.463, p = 0.338, and p = 0.544 respectively). g) Average $\%$ of long-rallies ($\geq$ 3) performed in a session increased in the second time interval in all groups. This within-group difference was only significant for the MCC and HCC groups (One-way ANOVA test, p = 1.172e-7, p = 1.525e-24, p = 0.233, p = 0.320, and p = 0.650 for MCC, HCC, DQN, A2C, and PPO, respectively). *$p < 0.05$, **$p<0.01$, and ***$p < 0.001$. h) Pairwise Tukey's post-hoc test shows that the HCC group is significantly outperformed by A2C and PPO in the first 5 minutes in terms of the hit counts. Biological cultures, however, do significantly better compared to all deep RL opponents in the 15 minutes interval. i) Using pairwise Tukey's post-hoc test, HCC group significantly outperforms the DQN and A2C groups in the last 15 minutes interval with a lower average of % Aces. DQN is also outperformed by the MCC group in this time interval. j) Pairwise comparison using Tukey's test shows a significant difference in the percentage of long rallies between HCC and the rest of the groups in the first 5 minutes all outperforming the HCC. However, this is later altered in the last 15 minutes with only MCC outperforming PPO significantly having an increased % of long rallies. Box plots show interquartile range, with bars demonstrating 1.5X interquartile range, the line marks the median, and the black triangle marks the mean. Error bands = 1 SE
Figure 4: Ball Position Input to the deep RL algorithms.a) Schematic highlighting figure comparisons are between biological DishBrain system and Ball Position Input to RL algorithms. Average number of b) hits-per-rally, c)$\%$ of aces, and d)$\%$ of long rallies over 20 minutes real-time equivalent of training DQN, A2C, PPO, and MCC, HCC cultures. A regressor line on the mean values with a 95% confidence interval highlights the learning trends. The highest level of average hits-per-rally is achieved by the neuronal MCC and HCC cultures. The average $\%$ of aces is lowest for the neuronal cultures compared to all deep RL baseline methods. The average $\%$ of long rallies reaches its highest levels for MCC and HCC. Comparing to the same findings for the HCC and MCC groups, e) average rally length over time only showed a significant increase in the biological cultures between the two time intervals (One-way ANOVA test, p = 0.995, p = 0.812, and p = 0.547 for DQN, A2C, and PPO respectively). f) Average % of aces within groups and over time only showed a significant difference in the MCC and HCC groups. No significant change was detected within the DQN, A2C, or PPO groups (One-way ANOVA test, p = 0.241, p = 0.581, and p = 0.216 respectively). g) Average % of long-rallies ($\geq$ 3) performed in a session increased in the second time interval in all groups except DQN. This within-group difference was only significant for MCC, HCC, and A2C groups with p = 0.002 for the A2C group. *$p < 0.05$, **$p<0.01$, and ***$p < 0.001$. h) Pairwise Tukey's post-hoc test shows that biological cultures significantly outperform all deep RL groups in the last 15 minutes in terms of the hit counts or rally length. i) Using pairwise Tukey's post-hoc test, the HCC group significantly outperforms all the deep RL groups in the last 15 minutes interval while MCC also outperforms DQN with a lower average of % Aces. j) Pairwise comparison using Tukey's test shows a significant out-performance of all groups over HCC in the percentage of long rallies in the first 5 minutes. In the second time interval, MCC shows a significantly higher $\%$ of long rallies compared to DQN with HCC now being outperformed only by A2C. Box plots show interquartile range, with bars demonstrating 1.5X interquartile range, the line marks the median and the black triangle marks the mean. Error bands = 1 SE
Figure 5: Paddle movement and relative improvement. The average paddle movement in pixels in all the different groups for the a)Image Input, c)Paddle&Ball Position Input, and e)Ball Position Input to the deep RL algorithms. Tukey's post-hoc test was conducted showing that DQN, PPO, and A2C had a significantly higher average paddle movement compared to HCC and MCC in all scenarios. Relative improvement (%) in the average hit counts between the first 5 minutes and the last 15 minutes of all sessions in each separate group for the b)Image Input, d)Paddle&Ball Position Input, and f)Ball Position Input to the deep RL algorithms. The biological groups show higher improvements with HCC outperforming all. b) Using Games Howell post-hoc test, the inter-group differences were significant with HCC outperforming all other groups, as well as MCC significantly outperforming PPO. d) HCC showed a significantly higher relative improvement compared to all the other groups while MCC also outperformed A2C and PPO in terms of relative improvement over time. f) Finally, HCC could still perform significantly better than all the deep RL groups with the Ball Position Input design to the deep RL algorithms with MCC outperforming PPO and DQN in this design. Distribution of frequency of mean summed hits per minute amongst groups for g) biological cultures and deep RL algorithms with h)Image Input, i)Paddle&Ball Position Input, and j)Ball Position Input.
...and 9 more figures

Biological Neurons Compete with Deep Reinforcement Learning in Sample Efficiency in a Simulated Gameworld

TL;DR

Abstract

Biological Neurons Compete with Deep Reinforcement Learning in Sample Efficiency in a Simulated Gameworld

Authors

TL;DR

Abstract

Table of Contents

Figures (14)