Table of Contents
Fetching ...

Task-Optimized Convolutional Recurrent Networks Align with Tactile Processing in the Rodent Brain

Trinity Chung, Yuchen Shen, Nathan C. L. Kong, Aran Nayebi

TL;DR

This work instrumentally bridges tactile neuroscience and embodied AI by training task-optimized temporal networks on biomechanically faithful tactile sequences from rodent whisker simulations. Using an Encoder-Attender-Decoder (EAD) framework, it shows ConvRNN encoders, particularly IntersectionRNN, offer superior tactile categorization and neural alignment with rodent somatosensory cortex, outperforming feedforward and state-space baselines. Self-supervised tactile learning, especially SimCLR with tactile augmentations, achieves neural fits comparable to supervised training, revealing ethologically relevant learning signals. The study quantifies inductive biases required for brain-like tactile representations and highlights recurrent architectures and tactile-specific SSL as keys for robust tactile perception in unstructured environments.

Abstract

Tactile sensing remains far less understood in neuroscience and less effective in artificial systems compared to more mature modalities such as vision and language. We bridge these gaps by introducing a novel Encoder-Attender-Decoder (EAD) framework to systematically explore the space of task-optimized temporal neural networks trained on realistic tactile input sequences from a customized rodent whisker-array simulator. We identify convolutional recurrent neural networks (ConvRNNs) as superior encoders to purely feedforward and state-space architectures for tactile categorization. Crucially, these ConvRNN-encoder-based EAD models achieve neural representations closely matching rodent somatosensory cortex, saturating the explainable neural variability and revealing a clear linear relationship between supervised categorization performance and neural alignment. Furthermore, contrastive self-supervised ConvRNN-encoder-based EADs, trained with tactile-specific augmentations, match supervised neural fits, serving as an ethologically-relevant, label-free proxy. For neuroscience, our findings highlight nonlinear recurrent processing as important for general-purpose tactile representations in somatosensory cortex, providing the first quantitative characterization of the underlying inductive biases in this system. For embodied AI, our results emphasize the importance of recurrent EAD architectures to handle realistic tactile inputs, along with tailored self-supervised learning methods for achieving robust tactile perception with the same type of sensors animals use to sense in unstructured environments.

Task-Optimized Convolutional Recurrent Networks Align with Tactile Processing in the Rodent Brain

TL;DR

This work instrumentally bridges tactile neuroscience and embodied AI by training task-optimized temporal networks on biomechanically faithful tactile sequences from rodent whisker simulations. Using an Encoder-Attender-Decoder (EAD) framework, it shows ConvRNN encoders, particularly IntersectionRNN, offer superior tactile categorization and neural alignment with rodent somatosensory cortex, outperforming feedforward and state-space baselines. Self-supervised tactile learning, especially SimCLR with tactile augmentations, achieves neural fits comparable to supervised training, revealing ethologically relevant learning signals. The study quantifies inductive biases required for brain-like tactile representations and highlights recurrent architectures and tactile-specific SSL as keys for robust tactile perception in unstructured environments.

Abstract

Tactile sensing remains far less understood in neuroscience and less effective in artificial systems compared to more mature modalities such as vision and language. We bridge these gaps by introducing a novel Encoder-Attender-Decoder (EAD) framework to systematically explore the space of task-optimized temporal neural networks trained on realistic tactile input sequences from a customized rodent whisker-array simulator. We identify convolutional recurrent neural networks (ConvRNNs) as superior encoders to purely feedforward and state-space architectures for tactile categorization. Crucially, these ConvRNN-encoder-based EAD models achieve neural representations closely matching rodent somatosensory cortex, saturating the explainable neural variability and revealing a clear linear relationship between supervised categorization performance and neural alignment. Furthermore, contrastive self-supervised ConvRNN-encoder-based EADs, trained with tactile-specific augmentations, match supervised neural fits, serving as an ethologically-relevant, label-free proxy. For neuroscience, our findings highlight nonlinear recurrent processing as important for general-purpose tactile representations in somatosensory cortex, providing the first quantitative characterization of the underlying inductive biases in this system. For embodied AI, our results emphasize the importance of recurrent EAD architectures to handle realistic tactile inputs, along with tailored self-supervised learning methods for achieving robust tactile perception with the same type of sensors animals use to sense in unstructured environments.

Paper Structure

This paper contains 12 sections, 4 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: ShapeNet Whisking Dataset.(a)(I) With average mouse whisker array measurements from bresee_mice_morphology, (II) objects are whisked in simulation using WHISKiT zweifel2021dynamical resulting in (III) force and torque data for sweeping 9981 ShapeNet objects of 117 categories with various sweep augmentations. The augmentations vary the (1) speed, (2) height, (3), rotation, and (4) distance of the objects relative to the whisker array. We constructed two datasets: a large, low-fidelity set with more sweep augmentations, and a small, high-fidelity set with fewer augmentations (see appendix). (b) An SVM classification on up to 4 different classes of objects (cups, microwaves, chairs, and trains) for the 2 datasets show that the classes are distinguishable (above chance).
  • Figure 2: (a)Encoder-Attender-Decoder (EAD) architecture, with task objectives being supervised categorization, self-supervised learning (SimCLR, SimSiam, autoencoding). The ConvRNN encoder includes self-recurrence at each layer where we vary different RNNs. (b) Types of data augmentations applied to SSL models. Given a temporal tactile input over time $T$, our tactile augmentation vertically, horizontally, temporally flips, and rotates the features, while traditional image augmentation introduces Gaussian noise, color jitter, and grayscale.
  • Figure 3: Tactile Categorization Accuracy. The lighter-colored left bar represents the randomly initialized version for every model. The best-performing model is Zhuang+GPT+Supervised (rightmost yellow bar). Models with the encoder being S4 are excluded as the training losses explode before the first epoch is finished.
  • Figure 4: Model Neural Evaluation.(a) We use six different stimuli (concave/convex $\times$near/medium/far) replicating the conditions in the mouse neural dataset in simulation. (Real images were taken from video recordings in neural dataset rodgers2022detailed). (b) Comparison of neural fit (noise-corrected RSA Pearson's $r$) across models. The mean animal-to-animal score is 0.18 and the maximum between all pairs of animals is 1.34. The leftmost "a2a" bar represents the mean animal-to-animal neural consistency score. The lighter-colored left bar represents the randomly initialized version for every model.
  • Figure 5: Comparing Task Performance and Neural Fit.(a) The task performance of SSL models are about one order of magnitude below the performance of supervised models, yet are able to achieve comparable neural fit. (b) For supervised models, we observe a trend of better task performance leading to increased neural correspondence. Plotting a best fit line, we find the correlation $r=0.59$. (c) The tactile augmentations were effective in improving both the neural fit and task performance. The models were unable to be trained with image augmentations.
  • ...and 3 more figures