Table of Contents
Fetching ...

I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in Social Robots

Giulio Antonio Abbo, Tony Belpaeme

TL;DR

This work addresses the challenge of making dialogue agents contextually aware by grounding responses in real-time visual input. It implements a vision-enabled dialogue system using GPT-4 as the core LLM, a four-component architecture, and a frame-summarisation strategy that maintains at most $n$ frames and summarizes the first $m$ when needed (e.g., $n=4$, $m=3$) to control prompt size. Six Furhat robot interactions across varied environments demonstrate improved scene understanding, environmental grounding, and context-aware responses, while revealing limitations in memory, response latency, and temporal resolution. The paper contributes a concrete multimodal dialogue framework, practical prompting guidelines, and empirical ablation insights that inform future design of real-time vision-grounded human–robot dialogue systems.

Abstract

In the rapidly evolving landscape of human-computer interaction, the integration of vision capabilities into conversational agents stands as a crucial advancement. This paper presents an initial implementation of a dialogue manager that leverages the latest progress in Large Language Models (e.g., GPT-4, IDEFICS) to enhance the traditional text-based prompts with real-time visual input. LLMs are used to interpret both textual prompts and visual stimuli, creating a more contextually aware conversational agent. The system's prompt engineering, incorporating dialogue with summarisation of the images, ensures a balance between context preservation and computational efficiency. Six interactions with a Furhat robot powered by this system are reported, illustrating and discussing the results obtained. By implementing this vision-enabled dialogue system, the paper envisions a future where conversational agents seamlessly blend textual and visual modalities, enabling richer, more context-aware dialogues.

I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in Social Robots

TL;DR

This work addresses the challenge of making dialogue agents contextually aware by grounding responses in real-time visual input. It implements a vision-enabled dialogue system using GPT-4 as the core LLM, a four-component architecture, and a frame-summarisation strategy that maintains at most frames and summarizes the first when needed (e.g., , ) to control prompt size. Six Furhat robot interactions across varied environments demonstrate improved scene understanding, environmental grounding, and context-aware responses, while revealing limitations in memory, response latency, and temporal resolution. The paper contributes a concrete multimodal dialogue framework, practical prompting guidelines, and empirical ablation insights that inform future design of real-time vision-grounded human–robot dialogue systems.

Abstract

In the rapidly evolving landscape of human-computer interaction, the integration of vision capabilities into conversational agents stands as a crucial advancement. This paper presents an initial implementation of a dialogue manager that leverages the latest progress in Large Language Models (e.g., GPT-4, IDEFICS) to enhance the traditional text-based prompts with real-time visual input. LLMs are used to interpret both textual prompts and visual stimuli, creating a more contextually aware conversational agent. The system's prompt engineering, incorporating dialogue with summarisation of the images, ensures a balance between context preservation and computational efficiency. Six interactions with a Furhat robot powered by this system are reported, illustrating and discussing the results obtained. By implementing this vision-enabled dialogue system, the paper envisions a future where conversational agents seamlessly blend textual and visual modalities, enabling richer, more context-aware dialogues.
Paper Structure (12 sections, 3 figures)

This paper contains 12 sections, 3 figures.

Figures (3)

  • Figure 1: Overview of the components of the system: the dialogue manager receives inputs from the frame and dialogue processing components, and uses a LLM to produce the outputs.
  • Figure 2: Example of the summarisation process. Considering $n=3$ and $m=2$. In the first step, a frame is added. The number of frames is now $n$, so the algorithm summarises the first two. Then a dialogue line is added, with the system response, and then another frame is added. In the final step, FRAME 5 triggers another summarisation. This time, only FRAME 3 is summarised, as including the following frame would disrupt the ordering of the elements. In grey the elements used to obtain the summary at each step: notice that the previous part of the conversation is included.
  • Figure 3: Frames from the videos of the interactions showing the kitchen, bedroom, bathroom and entrance environments.