Table of Contents
Fetching ...

A Graph-to-Text Approach to Knowledge-Grounded Response Generation in Human-Robot Interaction

Nicholas Thomas Walker, Stefan Ultes, Pierre Lison

TL;DR

This work introduces a dynamic knowledge-graph-based framework for knowledge-grounded response generation in human-robot interaction. A traversal-based graph-to-text verbalization converts graph content into natural language, which is then included in prompts to large language models (GPT-4 and LLaMA-2) to generate robot responses; parameters for verbalization are tuned with Wizard-of-Oz data. In a user study, the graph-verbalized approach improves factuality of robot responses compared with a semantic-triples baseline, particularly for GPT-4, while highlighting challenges related to perception modules and model latency. The study demonstrates the practical viability of graph-to-text verbalization in embodied AI and outlines limitations and directions for future work, including retrieval-based generation and richer multimodal grounding.

Abstract

Knowledge graphs are often used to represent structured information in a flexible and efficient manner, but their use in situated dialogue remains under-explored. This paper presents a novel conversational model for human--robot interaction that rests upon a graph-based representation of the dialogue state. The knowledge graph representing the dialogue state is continuously updated with new observations from the robot sensors, including linguistic, situated and multimodal inputs, and is further enriched by other modules, in particular for spatial understanding. The neural conversational model employed to respond to user utterances relies on a simple but effective graph-to-text mechanism that traverses the dialogue state graph and converts the traversals into a natural language form. This conversion of the state graph into text is performed using a set of parameterized functions, and the values for those parameters are optimized based on a small set of Wizard-of-Oz interactions. After this conversion, the text representation of the dialogue state graph is included as part of the prompt of a large language model used to decode the agent response. The proposed approach is empirically evaluated through a user study with a humanoid robot that acts as conversation partner to evaluate the impact of the graph-to-text mechanism on the response generation. After moving a robot along a tour of an indoor environment, participants interacted with the robot using spoken dialogue and evaluated how well the robot was able to answer questions about what the robot observed during the tour. User scores show a statistically significant improvement in the perceived factuality of the robot responses when the graph-to-text approach is employed, compared to a baseline using inputs structured as semantic triples.

A Graph-to-Text Approach to Knowledge-Grounded Response Generation in Human-Robot Interaction

TL;DR

This work introduces a dynamic knowledge-graph-based framework for knowledge-grounded response generation in human-robot interaction. A traversal-based graph-to-text verbalization converts graph content into natural language, which is then included in prompts to large language models (GPT-4 and LLaMA-2) to generate robot responses; parameters for verbalization are tuned with Wizard-of-Oz data. In a user study, the graph-verbalized approach improves factuality of robot responses compared with a semantic-triples baseline, particularly for GPT-4, while highlighting challenges related to perception modules and model latency. The study demonstrates the practical viability of graph-to-text verbalization in embodied AI and outlines limitations and directions for future work, including retrieval-based generation and richer multimodal grounding.

Abstract

Knowledge graphs are often used to represent structured information in a flexible and efficient manner, but their use in situated dialogue remains under-explored. This paper presents a novel conversational model for human--robot interaction that rests upon a graph-based representation of the dialogue state. The knowledge graph representing the dialogue state is continuously updated with new observations from the robot sensors, including linguistic, situated and multimodal inputs, and is further enriched by other modules, in particular for spatial understanding. The neural conversational model employed to respond to user utterances relies on a simple but effective graph-to-text mechanism that traverses the dialogue state graph and converts the traversals into a natural language form. This conversion of the state graph into text is performed using a set of parameterized functions, and the values for those parameters are optimized based on a small set of Wizard-of-Oz interactions. After this conversion, the text representation of the dialogue state graph is included as part of the prompt of a large language model used to decode the agent response. The proposed approach is empirically evaluated through a user study with a humanoid robot that acts as conversation partner to evaluate the impact of the graph-to-text mechanism on the response generation. After moving a robot along a tour of an indoor environment, participants interacted with the robot using spoken dialogue and evaluated how well the robot was able to answer questions about what the robot observed during the tour. User scores show a statistically significant improvement in the perceived factuality of the robot responses when the graph-to-text approach is employed, compared to a baseline using inputs structured as semantic triples.
Paper Structure (42 sections, 2 equations, 7 figures, 5 tables)

This paper contains 42 sections, 2 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Diagram of the proposed approach. The red path in the graph indicates a traversal which is transformed to a natural language document via a parameterized function. The resulting text description is then inserted into the prompt for the language model.
  • Figure 2: Example of subgraph where an image is assigned a location, and a "laptop" entity detected in the image data is created as a node with an "in" relation to the image.
  • Figure 3: Example path drawn from spatial coordinates, with raw coordinates on the left and the approximated path on the right. The orange points mark coordinates associated with the room label, blue points are those labeled as located in the hallway. Approximate forward movements and in-place rotations are highlighted alongside the robot path in red and green, respectively.
  • Figure 4: Comparison of a dialogue graph represented with triples versus verbalization.
  • Figure 5: The first author in a dialogue with the robot. A video of a short example tour may be found at https://youtu.be/a52zBcfVgS8
  • ...and 2 more figures