Table of Contents
Fetching ...

A3RNN: Bi-directional Fusion of Bottom-up and Top-down Process for Developmental Visual Attention in Robots

Hyogo Hiruma, Hiroshi Ito, Hiroki Mori, Tetsuya Ogata

TL;DR

A^3RNN tackles the emergence of human-like visual attention by tightly coupling bottom-up saliency with top-down predictive signals within a developmental robotics framework grounded in the free-energy principle. The model fuses BU and TD cues through a Transformer-based Amalgamated Active Attention module, integrated with a Hierarchical LSTM and a reconstruction-driven auxiliary objective to shape stable, interpretable attention over training. Key contributions include a novel bi-directional BU–TD fusion mechanism, explicit decoupling and integration to avoid degenerate learning, and reconstruction-based regularization that promotes temporal and perceptual coherence. Empirical results in a robotic pick task show more coherent attention dynamics and greater stability than prior approaches, supporting the view that attention can self-organize through predictive learning. The work advances cognitive-inspired robotics by providing a scalable, developmental pathway for robust, human-like attention in embodied agents, with potential applications in more complex manipulation and perception tasks.

Abstract

This study investigates the developmental interaction between top-down (TD) and bottom-up (BU) visual attention in robotic learning. Our goal is to understand how structured, human-like attentional behavior emerges through the mutual adaptation of TD and BU mechanisms over time. To this end, we propose a novel attention model $A^3 RNN$ that integrates predictive TD signals and saliency-based BU cues through a bi-directional attention architecture. We evaluate our model in robotic manipulation tasks using imitation learning. Experimental results show that attention behaviors evolve throughout training, from saliency-driven exploration to prediction-driven direction. Initially, BU attention highlights visually salient regions, which guide TD processes, while as learning progresses, TD attention stabilizes and begins to reshape what is perceived as salient. This trajectory reflects principles from cognitive science and the free-energy framework, suggesting the importance of self-organizing attention through interaction between perception and internal prediction. Although not explicitly optimized for stability, our model exhibits more coherent and interpretable attention patterns than baselines, supporting the idea that developmental mechanisms contribute to robust attention formation.

A3RNN: Bi-directional Fusion of Bottom-up and Top-down Process for Developmental Visual Attention in Robots

TL;DR

A^3RNN tackles the emergence of human-like visual attention by tightly coupling bottom-up saliency with top-down predictive signals within a developmental robotics framework grounded in the free-energy principle. The model fuses BU and TD cues through a Transformer-based Amalgamated Active Attention module, integrated with a Hierarchical LSTM and a reconstruction-driven auxiliary objective to shape stable, interpretable attention over training. Key contributions include a novel bi-directional BU–TD fusion mechanism, explicit decoupling and integration to avoid degenerate learning, and reconstruction-based regularization that promotes temporal and perceptual coherence. Empirical results in a robotic pick task show more coherent attention dynamics and greater stability than prior approaches, supporting the view that attention can self-organize through predictive learning. The work advances cognitive-inspired robotics by providing a scalable, developmental pathway for robust, human-like attention in embodied agents, with potential applications in more complex manipulation and perception tasks.

Abstract

This study investigates the developmental interaction between top-down (TD) and bottom-up (BU) visual attention in robotic learning. Our goal is to understand how structured, human-like attentional behavior emerges through the mutual adaptation of TD and BU mechanisms over time. To this end, we propose a novel attention model that integrates predictive TD signals and saliency-based BU cues through a bi-directional attention architecture. We evaluate our model in robotic manipulation tasks using imitation learning. Experimental results show that attention behaviors evolve throughout training, from saliency-driven exploration to prediction-driven direction. Initially, BU attention highlights visually salient regions, which guide TD processes, while as learning progresses, TD attention stabilizes and begins to reshape what is perceived as salient. This trajectory reflects principles from cognitive science and the free-energy framework, suggesting the importance of self-organizing attention through interaction between perception and internal prediction. Although not explicitly optimized for stability, our model exhibits more coherent and interpretable attention patterns than baselines, supporting the idea that developmental mechanisms contribute to robust attention formation.

Paper Structure

This paper contains 17 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Proposed model structure. (A) The entire structure of the model, composed of $A^3$ module, Hierarchical LSTM (H-LSTM) module and Reconstruction module. (B) Detailed structure of $A^3$ module, where bottom-up queries and top-down queries are fused to an amalgamated query via a Transformer self-attention block.
  • Figure 2: Detail structure of the reconstruction module. The predicted attention points of bottom-up and top-down attention are fed to reconstruct peripheral and foveal images, respectively.
  • Figure 3: Comparison of the behaviors of top-down and bottom-up attention, visualized as red circles on the image and attention maps, respectively. Each column represents different training epochs and each row represents different timesteps. In the attention maps, brighter colors represent higher values which indicates higher importance score allocated to the pixels.
  • Figure 4: Setup of a simulator experiment. The robot arm was trained on a simple pick up task of a wooden box, placed at one of three different locations.
  • Figure 5: The developmental transition of average similarity between the amalgamated queries (Attention 1-4) and the BU pseudo queries.