Table of Contents
Fetching ...

Cognitively-Inspired Emergent Communication via Knowledge Graphs for Assisting the Visually Impaired

Ruxiao Chen, Dezheng Han, Wenjie Han, Shuaishuai Guo

TL;DR

This work tackles the need for fast yet semantically rich guidance for visually impaired users in dynamic environments by introducing VAG-EC, a cognitively inspired emergent communication framework that grounds messages in knowledge graphs. Scenes are converted into object-centric graphs via SAM segmentation and proximity-based edges, with Graph Convolutional Networks and attention producing a structured representation that informs compact symbolic messages. Messages are learned through a Lewis-style referential game using a differentiable Gumbel-Softmax relaxation, enabling end-to-end optimization. Across varying vocabulary sizes, VAG-EC achieves higher Context Independence (CI) and Topographic Similarity (TopSim) than baselines, and exhibits more balanced token usage, reflecting stronger semantic grounding and interpretability. The approach demonstrates potential for real-time, human-aligned assistive modalities, though broader-domain and human-in-the-loop evaluations remain as future directions.

Abstract

Assistive systems for visually impaired individuals must deliver rapid, interpretable, and adaptive feedback to facilitate real-time navigation. Current approaches face a trade-off between latency and semantic richness: natural language-based systems provide detailed guidance but are too slow for dynamic scenarios, while emergent communication frameworks offer low-latency symbolic languages but lack semantic depth, limiting their utility in tactile modalities like vibration. To address these limitations, we introduce a novel framework, Cognitively-Inspired Emergent Communication via Knowledge Graphs (VAG-EC), which emulates human visual perception and cognitive mapping. Our method constructs knowledge graphs to represent objects and their relationships, incorporating attention mechanisms to prioritize task-relevant entities, thereby mirroring human selective attention. This structured approach enables the emergence of compact, interpretable, and context-sensitive symbolic languages. Extensive experiments across varying vocabulary sizes and message lengths demonstrate that VAG-EC outperforms traditional emergent communication methods in Topographic Similarity (TopSim) and Context Independence (CI). These findings underscore the potential of cognitively grounded emergent communication as a fast, adaptive, and human-aligned solution for real-time assistive technologies. Code is available at https://github.com/Anonymous-NLPcode/Anonymous_submission/tree/main.

Cognitively-Inspired Emergent Communication via Knowledge Graphs for Assisting the Visually Impaired

TL;DR

This work tackles the need for fast yet semantically rich guidance for visually impaired users in dynamic environments by introducing VAG-EC, a cognitively inspired emergent communication framework that grounds messages in knowledge graphs. Scenes are converted into object-centric graphs via SAM segmentation and proximity-based edges, with Graph Convolutional Networks and attention producing a structured representation that informs compact symbolic messages. Messages are learned through a Lewis-style referential game using a differentiable Gumbel-Softmax relaxation, enabling end-to-end optimization. Across varying vocabulary sizes, VAG-EC achieves higher Context Independence (CI) and Topographic Similarity (TopSim) than baselines, and exhibits more balanced token usage, reflecting stronger semantic grounding and interpretability. The approach demonstrates potential for real-time, human-aligned assistive modalities, though broader-domain and human-in-the-loop evaluations remain as future directions.

Abstract

Assistive systems for visually impaired individuals must deliver rapid, interpretable, and adaptive feedback to facilitate real-time navigation. Current approaches face a trade-off between latency and semantic richness: natural language-based systems provide detailed guidance but are too slow for dynamic scenarios, while emergent communication frameworks offer low-latency symbolic languages but lack semantic depth, limiting their utility in tactile modalities like vibration. To address these limitations, we introduce a novel framework, Cognitively-Inspired Emergent Communication via Knowledge Graphs (VAG-EC), which emulates human visual perception and cognitive mapping. Our method constructs knowledge graphs to represent objects and their relationships, incorporating attention mechanisms to prioritize task-relevant entities, thereby mirroring human selective attention. This structured approach enables the emergence of compact, interpretable, and context-sensitive symbolic languages. Extensive experiments across varying vocabulary sizes and message lengths demonstrate that VAG-EC outperforms traditional emergent communication methods in Topographic Similarity (TopSim) and Context Independence (CI). These findings underscore the potential of cognitively grounded emergent communication as a fast, adaptive, and human-aligned solution for real-time assistive technologies. Code is available at https://github.com/Anonymous-NLPcode/Anonymous_submission/tree/main.

Paper Structure

This paper contains 20 sections, 8 equations, 5 figures.

Figures (5)

  • Figure 1: Pipeline for constructing a knowledge graph from a dining image. The input image is segmented using Segment Anything, followed by object extraction and feature encoding. Node attributes are derived from object embeddings, while edge attributes are computed based on spatial proximity, forming a structured graph representation of the scene.
  • Figure 2: Overview of the proposed VAG-EC framework. Visual scenes are segmented and converted into knowledge graphs, which are encoded by the speaker to generate discrete messages. The listener decodes the message to identify the correct scene, with human feedback guiding end-to-end optimization.
  • Figure 3: Quantitative comparison between the baseline EC and our proposed VAG-EC framework. (a)--(c) show token-level statistics: Zipf distribution, cumulative token coverage, and frequency histogram. (d) reports task-level performance across three metrics (Accuracy, TopSim, and Context Independence) under varying vocabulary sizes.
  • Figure 4: Example of a generated dining scenario from the synthetic dataset.
  • Figure 5: Example of a real-world dining scenario used for testing.