Table of Contents
Fetching ...

A General Close-loop Predictive Coding Framework for Auditory Working Memory

Zhongju Yuan, Geraint Wiggins, Dick Botteldooren

TL;DR

The paper addresses the lack of neural-network models for auditory working memory by introducing a general close-loop predictive coding framework. A two-layer network with learnable memory in the weights writes sequence information during a write phase and recalls it in a read phase using fixed weights, with a close-loop feedback mechanism enhancing recall. The approach is evaluated on two diverse datasets, ESC-50 and LibriSpeech, using 200 ms segments and textual semantic similarity (via CLAP captions and Whisper transcripts) to measure recall fidelity, with results showing semantic similarity scores consistently above 0.7. These findings suggest the framework can robustly preserve meaningful auditory representations across environmental sounds and speech, highlighting a biologically inspired path for memory formation and retrieval in neural systems.

Abstract

Auditory working memory is essential for various daily activities, such as language acquisition, conversation. It involves the temporary storage and manipulation of information that is no longer present in the environment. While extensively studied in neuroscience and cognitive science, research on its modeling within neural networks remains limited. To address this gap, we propose a general framework based on a close-loop predictive coding paradigm to perform short auditory signal memory tasks. The framework is evaluated on two widely used benchmark datasets for environmental sound and speech, demonstrating high semantic similarity across both datasets.

A General Close-loop Predictive Coding Framework for Auditory Working Memory

TL;DR

The paper addresses the lack of neural-network models for auditory working memory by introducing a general close-loop predictive coding framework. A two-layer network with learnable memory in the weights writes sequence information during a write phase and recalls it in a read phase using fixed weights, with a close-loop feedback mechanism enhancing recall. The approach is evaluated on two diverse datasets, ESC-50 and LibriSpeech, using 200 ms segments and textual semantic similarity (via CLAP captions and Whisper transcripts) to measure recall fidelity, with results showing semantic similarity scores consistently above 0.7. These findings suggest the framework can robustly preserve meaningful auditory representations across environmental sounds and speech, highlighting a biologically inspired path for memory formation and retrieval in neural systems.

Abstract

Auditory working memory is essential for various daily activities, such as language acquisition, conversation. It involves the temporary storage and manipulation of information that is no longer present in the environment. While extensively studied in neuroscience and cognitive science, research on its modeling within neural networks remains limited. To address this gap, we propose a general framework based on a close-loop predictive coding paradigm to perform short auditory signal memory tasks. The framework is evaluated on two widely used benchmark datasets for environmental sound and speech, demonstrating high semantic similarity across both datasets.

Paper Structure

This paper contains 11 sections, 6 equations, 5 figures.

Figures (5)

  • Figure 1: Memory Procedure for an Audio Sequence. (a) shows the write process, where weight matrices are trained in a supervised manner to minimize error terms. (b) illustrates the read process, during which the trained weights remain fixed. The model uses the recalled last segment to update the hidden state, enhancing the quality of the subsequent segment retrieval.
  • Figure 2: Waveform comparisons of original and recalled audio for two speech examples. The upper panels show the recalled waveforms without using the close-loop approach, while the lower panels show the recalled waveforms using the close-loop approach. The close-loop method significantly improves waveform accuracy and alignment, resulting in consistent recognition performance.
  • Figure 3: Classification probability distribution heatmaps for the original and recalled audio in the ESC-50 dataset. (a) A classification probability heatmap for the original audio demonstrates high accuracy, with diagonal elements nearing 1. However, a few semantically ambiguous classes, such as 'helicopter' and 'airplane' or 'crickets' and 'insects', show slight deviations. (b) Classification probability heatmap for the recalled audio, where most diagonal elements retain probabilities above 0.7.
  • Figure 4: Performance analysis of the environmental sound memory model. Boxplot of semantic similarities (SS) between original and recalled audio pairs in the ESC-50 dataset. The x-axis represents the 50 class names, while the y-axis shows the semantic similarity scores. Black dots indicate the distribution of similarity scores for each class.
  • Figure 5: Performance Analysis of the Speech Memory Model. The identification represents the Whisper recognition result for the ground truth audio, while recallability denotes the result obtained from the same model for the recalled audio. Panel (a) displays the semantic similarity (SS) as a box plot, highlighting variations in performance. Panel (b) illustrates cases with high SS, indicating successful retention of both semantic and acoustic features. In contrast, Panel (c) presents a negative sample with lower SS, demonstrating a failure to retain meaning despite preserving some degree of acoustic fidelity.