Table of Contents
Fetching ...

Learning to crawl: Benefits and limits of centralized vs distributed control

Luca Gagliardi, Agnese Seminara

TL;DR

By centralizing proprioceptive feedback and control, the crawler leverages long range correlations in the dynamics and ride the endogenous wave smoothly, and the ensuing benefits are measured in terms of both speed and robustness to failure, although they come at increased computational cost.

Abstract

We present a model of a crawler consisting of several suction units distributed along a straight line and connected by springs. The suction units are rudimentary proprioceptors-actuators, which sense binary states of compression vs elongation of the springs, and can either adhere or remain idle. Muscular contraction is not controlled by the crawler, but follows an endogenous, stereotyped wave. The crawler is tasked to learn patterns of adhesion that generate thrust in response to the wave of contraction. Using tabular Q-learning we demonstrate that crawling can be learned by trial and error and we ask what are the benefits and limitations of distributed vs centralized learning architectures. We find that by centralizing proprioceptive feedback and control, the crawler leverages long range correlations in the dynamics and ride the endogenous wave smoothly. The ensuing benefits are measured in terms of both speed and robustness to failure, although they come at increased computational cost. At the opposite extreme, purely distributed feedback and control only leverages local information and yield a jerkier and slower crawling, although computationally cheap. Intermediate levels of centralization can negotiate fast and robust crawling while avoiding excessive computational burden, demonstrating the computational benefits of a hierarchical organization of crawling. Our model unveils the trade-offs between crawling speed, robustness to failure, computational cost and information exchange that may shape biological solutions for crawling and could inspire the design of robotic crawlers.

Learning to crawl: Benefits and limits of centralized vs distributed control

TL;DR

By centralizing proprioceptive feedback and control, the crawler leverages long range correlations in the dynamics and ride the endogenous wave smoothly, and the ensuing benefits are measured in terms of both speed and robustness to failure, although they come at increased computational cost.

Abstract

We present a model of a crawler consisting of several suction units distributed along a straight line and connected by springs. The suction units are rudimentary proprioceptors-actuators, which sense binary states of compression vs elongation of the springs, and can either adhere or remain idle. Muscular contraction is not controlled by the crawler, but follows an endogenous, stereotyped wave. The crawler is tasked to learn patterns of adhesion that generate thrust in response to the wave of contraction. Using tabular Q-learning we demonstrate that crawling can be learned by trial and error and we ask what are the benefits and limitations of distributed vs centralized learning architectures. We find that by centralizing proprioceptive feedback and control, the crawler leverages long range correlations in the dynamics and ride the endogenous wave smoothly. The ensuing benefits are measured in terms of both speed and robustness to failure, although they come at increased computational cost. At the opposite extreme, purely distributed feedback and control only leverages local information and yield a jerkier and slower crawling, although computationally cheap. Intermediate levels of centralization can negotiate fast and robust crawling while avoiding excessive computational burden, demonstrating the computational benefits of a hierarchical organization of crawling. Our model unveils the trade-offs between crawling speed, robustness to failure, computational cost and information exchange that may shape biological solutions for crawling and could inspire the design of robotic crawlers.

Paper Structure

This paper contains 21 sections, 10 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: a) Sketch of the 1D crawler model. b) Sketch of the distributed architecture, including the 4 states available to each agent (individual sucker), the 2 actions the sucker controls and the 4$\times$2 Q-matrix for each sucker. c) Centralized architecture: one or more Control Center (CC) control a subset of contiguous suckers. The states for each CC correspond to all combinations of compression/elongation states of the springs within the CC; the actions are all combinations of adhesion/no-adhesion of suckers within the CC. For multiagency (whether the individual agents are suckers or CCs), the "hive update" corresponds to forcing all agents to agree, i.e. write and read the same $Q$ matrix.
  • Figure 2: Illustration of the training protocol on a 12 suckers crawler for various learning architectures. (a) Average return $G=1/|\mathcal{S}|\sum_s\max_a\left(Q(s,a)\right)$ as a function of learning episode for distributed learning architectures with the same Q matrix for all agents (hive, top) or one Q matrix per agent (standard, bottom). Top left: $G$ initially increases and then slows down; the process is repeated increasing the number of steps per episode (color lines) until learning reaches a flat plateau (black). Right: at the plateau, identified automatically (red dashed line), $\epsilon$ and $\alpha$ are increased again within an exploration phase that stores a set of 250 policies (each color represents a different policy). (b) Same as (a), for centralized architectures with 2 control centers hive (top) and standard (bottom) and for a single control center (bottom, right). See \ref{['tab:explorationParameters']} for details on the values of the parameters.
  • Figure 3: Average velocity of the crawler center of mass vs number of suckers, for different learning architectures. Markers and errorbars correspond to average and standard deviation of performance over the 250 best learned policies for distributed architecture (left) and centralized architectures (right). Colored bands in panel b) group results obtained for the same value of $N_s$ (which have been horizontally scattered for clarity, so symbols do not overlap).
  • Figure 4: Robustness analysis of the best policy for 12 suckers crawlers with various learning architectures. a) Performance (velocity of the crawler) vs number of failing suckers. b) Performance of the optimal policy with 1 failing sucker relative to optimal performance with no failing suckers. c) Frequency each sucker is the dominant contributor, measured by the largest drop in performance when the sucker fails: the head is the most important sucker. d) Performance drop due to failure of the head. e) Statistical behavior of a 12 sucker crawler for different distributed and centralized architectures. Top row: for each sucker (x axis), and each compression state of its adjacent springs (y axis), the color represents the frequency that state was visited while performing the optimal policy. Bottom row: Fraction of time a sucker adheres, colorcoded from purple (always adhere) to yellow (never adhere) for each of the 12 suckers (x axis) and each of the compression states of their adjacent springs (y axis). The blank entries for head and tail represent that these suckers only have access to one spring, hence two compression states (rather than four).
  • Figure 5: Distribution of speed for the policies learned by the 12 suckers crawler. a): number of distinct policies learned by the 12-sucker crawler with different architectures (blue); number of times the most popular policy was learned (red); b): detail of policies learned by the fully centralized agent; top: plateau in $G$, showing the optimal policy, labeled as $\pi_1$, is selected at several iterations of the exploration phase of the learning protocol (same as bottom right panel in fig. 2b, color change in the plateau marks policy change); bottom: histogram of the velocities across all 250 policies. Orange: number of times a crawler velocity is realized by a learned policy . Blue: number of unique policies associated to the velocity (within the bin). Bins where the blue histogram is lower than the orange histogram show identical policies which are selected multiple times in the learning protocol. c): Same as b), for the remaining learning architectures.
  • ...and 5 more figures