The Invisible Mentor: Inferring User Actions from Screen Recordings to Recommend Better Workflows
Litao Yan, Andrew Head, Ken Milne, Vu Le, Sumit Gulwani, Chris Parnin, Emerson Murphy-Hill
TL;DR
InvisibleMentor addresses the discoverability gap in complex software by using vision-grounded task reflection to infer user workflows directly from screen recordings, avoiding internal instrumentation or explicit prompts. The two-phase pipeline first reconstructs action sequences and spreadsheet context with a vision-language model, then leverages a language model to produce structured, high-fidelity, step-by-step recommendations grounded in observed behavior. Evaluation shows strong action-recovery accuracy ($F1\approx0.905$; inter-rater AC1 $\approx0.927$) and that users find behavior-grounded suggestions more actionable and learning-friendly than prompt-based baselines, with real-time or post-task guidance improving reflective learning. The approach demonstrates that visual activity–based assistance can be broadly applicable across domains, offering concrete, context-aware improvements while reducing the need for logs, APIs, or explicit user intent.
Abstract
Many users struggle to notice when a more efficient workflow exists in feature-rich tools like Excel. Existing AI assistants offer help only after users describe their goals or problems, which can be effortful and imprecise. We present InvisibleMentor, a system that turns screen recordings of task completion into vision-grounded reflections on tasks. It detects issues such as repetitive edits and recommends more efficient alternatives based on observed behavior. Unlike prior systems that rely on logs, APIs, or user prompts, InvisibleMentor operates directly on screen recordings. It uses a two-stage pipeline: a vision-language model reconstructs actions and context, and a language model generates structured, high-fidelity suggestions. In evaluation, InvisibleMentor accurately identified inefficient workflows, and participants found its suggestions more actionable, tailored, and more helpful for learning and improvement compared to a prompt-based spreadsheet assistant.
