Table of Contents
Fetching ...

CLIO: A Tour Guide Robot with Co-speech Actions for Visual Attention Guidance and Enhanced User Engagement

Yuxuan Chen, Ian Leong Ting Lo, Bao Guo, Netitorn Kawmali, Chun Kit Chan, Ruoyu Wang, Jia Pan, Lei Yang

TL;DR

The paper addresses the challenge of directing visitors' visual attention during guided tours by introducing CLIO, a tour-guide robot that coordinates co-speech actions with narration through an LLM-generated action queue. CLIO integrates eye contact, deictic gestures, and laser pointing with narration, powered by ROS2-based architecture, perception (MediaPipe, YOLOv11), and navigation (FAST-LIO2, Nav2), to ground and synchronize actions with the script. A 28-participant study demonstrates that CLIO improves perceived lifelike quality, engagement, and directness of visual attention compared to audio-only guidance, supported by objective eye-tracking metrics. The work presents a practical, end-to-end framework for enhancing museum tours through coordinated audio-gestural guidance and validates its effectiveness in improving visitor engagement and attention guidance.

Abstract

While audio guides can offer rich information about an exhibit, it is challenging for visitors to focus on specific exhibit details based only on the verbal description. We present \textit{CLIO}, a tour guide robot with co-speech actions to direct visitors' visual attention and thus enhance the overall user engagement in a guided tour. \textit{CLIO} is equipped with designed actions to engage visitors. It builds eye contact with the visitor through tracking a visitor's face and blinking its eyes, or orient their attention by its head movement and laser pointer. We further use a Large Language Model (LLM) to coordinate the designed actions with a given narrative script for exhibition. We conducted a user study to evaluate the \textit{CLIO} system in a mock-up exhibition of historical photographs. We collected feedback from questionnaires and quantitative data from a mobile eye tracker. Experimental results validated that the engaging actions are well designed and demonstrated its efficacy in guiding visual attention of the visitors. It was evidenced that \textit{CLIO} achieved an enhanced engagement compared to the baseline system with only audio guidance.

CLIO: A Tour Guide Robot with Co-speech Actions for Visual Attention Guidance and Enhanced User Engagement

TL;DR

The paper addresses the challenge of directing visitors' visual attention during guided tours by introducing CLIO, a tour-guide robot that coordinates co-speech actions with narration through an LLM-generated action queue. CLIO integrates eye contact, deictic gestures, and laser pointing with narration, powered by ROS2-based architecture, perception (MediaPipe, YOLOv11), and navigation (FAST-LIO2, Nav2), to ground and synchronize actions with the script. A 28-participant study demonstrates that CLIO improves perceived lifelike quality, engagement, and directness of visual attention compared to audio-only guidance, supported by objective eye-tracking metrics. The work presents a practical, end-to-end framework for enhancing museum tours through coordinated audio-gestural guidance and validates its effectiveness in improving visitor engagement and attention guidance.

Abstract

While audio guides can offer rich information about an exhibit, it is challenging for visitors to focus on specific exhibit details based only on the verbal description. We present \textit{CLIO}, a tour guide robot with co-speech actions to direct visitors' visual attention and thus enhance the overall user engagement in a guided tour. \textit{CLIO} is equipped with designed actions to engage visitors. It builds eye contact with the visitor through tracking a visitor's face and blinking its eyes, or orient their attention by its head movement and laser pointer. We further use a Large Language Model (LLM) to coordinate the designed actions with a given narrative script for exhibition. We conducted a user study to evaluate the \textit{CLIO} system in a mock-up exhibition of historical photographs. We collected feedback from questionnaires and quantitative data from a mobile eye tracker. Experimental results validated that the engaging actions are well designed and demonstrated its efficacy in guiding visual attention of the visitors. It was evidenced that \textit{CLIO} achieved an enhanced engagement compared to the baseline system with only audio guidance.

Paper Structure

This paper contains 21 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: CLIO, a tour guide robot, is developed to enhance user engagement in exhibition tours. The system offers coordinated audio-gestural guidance with engaging actions, such as eye contact (a, c) and pointing at the exhibit (b). Our system uses an LLM to generate a queue of actions (e). The tour manager schedules the queue of co-speech actions (solid/dashed border indicates executed or pending actions) (f). The action manager executes concurrent actions, e.g, 2nd section of (e), while the navigation module awaits Action $K$ to perform.
  • Figure 2: Tour guide system architecture.
  • Figure 3: CLIO Hardware. The robot is equipped with a head -- an LED screen that displays a pair of animated eyes. An RGB-D camera is mounted on the head. The body part houses a laser pointer on its left, a LiDAR sensor, an on-board computer, and an audio speaker. A wheel-legged robot base is adopted to provide an anthropomorphic image to the audience.
  • Figure 4: Experimental setup. (A) Two slightly different tour designs. (B) A mobile eye tracker (MET) was used to capture participants' eye gaze during the tour. (C) The MET.
  • Figure 5: Box plot of questionnaire results. All scales show statistical significance with $p$ < 0.001 (***).
  • ...and 2 more figures