Table of Contents
Fetching ...

Text2Traj2Text: Learning-by-Synthesis Framework for Contextual Captioning of Human Movement Trajectories

Hikaru Asano, Ryo Yonetani, Taiki Sekii, Hiroki Ouchi

TL;DR

This work introduces contextual captioning of human movement trajectories in retail and proposes Text2Traj2Text, a learning-by-synthesis framework that couples instruction-tuned LLMs for data generation with a Traj2Text captioning model. Text2Traj synthesizes diverse, realistic captions and corresponding trajectories via a hierarchical planner, action plans, and item lists, then Traj2Text fine-tunes a language model on this data, enhanced by paraphrase-based augmentation. Experiments show the method achieves state-of-the-art ROUGE and BERT Score on synthesized trajectories and generalizes well to real human trajectories and unseen store maps, often with far fewer parameters than large LLMs. The results suggest scalable, privacy-conscious utility for retailers in applications like targeted advertising and inventory management, while acknowledging limitations around long sequences and potential hallucinations requiring post-processing and opt-out options.

Abstract

This paper presents Text2Traj2Text, a novel learning-by-synthesis framework for captioning possible contexts behind shopper's trajectory data in retail stores. Our work will impact various retail applications that need better customer understanding, such as targeted advertising and inventory management. The key idea is leveraging large language models to synthesize a diverse and realistic collection of contextual captions as well as the corresponding movement trajectories on a store map. Despite learned from fully synthesized data, the captioning model can generalize well to trajectories/captions created by real human subjects. Our systematic evaluation confirmed the effectiveness of the proposed framework over competitive approaches in terms of ROUGE and BERT Score metrics.

Text2Traj2Text: Learning-by-Synthesis Framework for Contextual Captioning of Human Movement Trajectories

TL;DR

This work introduces contextual captioning of human movement trajectories in retail and proposes Text2Traj2Text, a learning-by-synthesis framework that couples instruction-tuned LLMs for data generation with a Traj2Text captioning model. Text2Traj synthesizes diverse, realistic captions and corresponding trajectories via a hierarchical planner, action plans, and item lists, then Traj2Text fine-tunes a language model on this data, enhanced by paraphrase-based augmentation. Experiments show the method achieves state-of-the-art ROUGE and BERT Score on synthesized trajectories and generalizes well to real human trajectories and unseen store maps, often with far fewer parameters than large LLMs. The results suggest scalable, privacy-conscious utility for retailers in applications like targeted advertising and inventory management, while acknowledging limitations around long sequences and potential hallucinations requiring post-processing and opt-out options.

Abstract

This paper presents Text2Traj2Text, a novel learning-by-synthesis framework for captioning possible contexts behind shopper's trajectory data in retail stores. Our work will impact various retail applications that need better customer understanding, such as targeted advertising and inventory management. The key idea is leveraging large language models to synthesize a diverse and realistic collection of contextual captions as well as the corresponding movement trajectories on a store map. Despite learned from fully synthesized data, the captioning model can generalize well to trajectories/captions created by real human subjects. Our systematic evaluation confirmed the effectiveness of the proposed framework over competitive approaches in terms of ROUGE and BERT Score metrics.
Paper Structure (36 sections, 6 figures, 6 tables)

This paper contains 36 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Contextual Captioning of Human Movement Trajectories. Given a human movement trajectory associated with semantic information such as nearby items and actual purchases in a retail store, we aim to produce contextual captions that best explain the possible contexts behind.
  • Figure 2: Text2Traj2Text Framework. (1) Text2Traj: We leverage LLMs to synthesize contextual captions and their instances as concrete action plans, item lists, and in-store trajectories. (2) Traj2Text: We fine-tune a language model with the synthesized data to be able to produce contextual captions from trajectory data.
  • Figure 3: Visual user interface used to collect human-created trajectories. The green square represents the current position. Information on the closest item is shown in the upper right corner, and the list of items added to the cart is shown in the lower right corner. The caption to be followed is presented at the bottom of the screen.
  • Figure 4: Prompt used for Step 1 in the Text2Traj phase.
  • Figure 5: Prompt used for Step 2 in the Text2Traj phase.
  • ...and 1 more figures