Table of Contents
Fetching ...

The Invisible Mentor: Inferring User Actions from Screen Recordings to Recommend Better Workflows

Litao Yan, Andrew Head, Ken Milne, Vu Le, Sumit Gulwani, Chris Parnin, Emerson Murphy-Hill

TL;DR

InvisibleMentor addresses the discoverability gap in complex software by using vision-grounded task reflection to infer user workflows directly from screen recordings, avoiding internal instrumentation or explicit prompts. The two-phase pipeline first reconstructs action sequences and spreadsheet context with a vision-language model, then leverages a language model to produce structured, high-fidelity, step-by-step recommendations grounded in observed behavior. Evaluation shows strong action-recovery accuracy ($F1\approx0.905$; inter-rater AC1 $\approx0.927$) and that users find behavior-grounded suggestions more actionable and learning-friendly than prompt-based baselines, with real-time or post-task guidance improving reflective learning. The approach demonstrates that visual activity–based assistance can be broadly applicable across domains, offering concrete, context-aware improvements while reducing the need for logs, APIs, or explicit user intent.

Abstract

Many users struggle to notice when a more efficient workflow exists in feature-rich tools like Excel. Existing AI assistants offer help only after users describe their goals or problems, which can be effortful and imprecise. We present InvisibleMentor, a system that turns screen recordings of task completion into vision-grounded reflections on tasks. It detects issues such as repetitive edits and recommends more efficient alternatives based on observed behavior. Unlike prior systems that rely on logs, APIs, or user prompts, InvisibleMentor operates directly on screen recordings. It uses a two-stage pipeline: a vision-language model reconstructs actions and context, and a language model generates structured, high-fidelity suggestions. In evaluation, InvisibleMentor accurately identified inefficient workflows, and participants found its suggestions more actionable, tailored, and more helpful for learning and improvement compared to a prompt-based spreadsheet assistant.

The Invisible Mentor: Inferring User Actions from Screen Recordings to Recommend Better Workflows

TL;DR

InvisibleMentor addresses the discoverability gap in complex software by using vision-grounded task reflection to infer user workflows directly from screen recordings, avoiding internal instrumentation or explicit prompts. The two-phase pipeline first reconstructs action sequences and spreadsheet context with a vision-language model, then leverages a language model to produce structured, high-fidelity, step-by-step recommendations grounded in observed behavior. Evaluation shows strong action-recovery accuracy (; inter-rater AC1 ) and that users find behavior-grounded suggestions more actionable and learning-friendly than prompt-based baselines, with real-time or post-task guidance improving reflective learning. The approach demonstrates that visual activity–based assistance can be broadly applicable across domains, offering concrete, context-aware improvements while reducing the need for logs, APIs, or explicit user intent.

Abstract

Many users struggle to notice when a more efficient workflow exists in feature-rich tools like Excel. Existing AI assistants offer help only after users describe their goals or problems, which can be effortful and imprecise. We present InvisibleMentor, a system that turns screen recordings of task completion into vision-grounded reflections on tasks. It detects issues such as repetitive edits and recommends more efficient alternatives based on observed behavior. Unlike prior systems that rely on logs, APIs, or user prompts, InvisibleMentor operates directly on screen recordings. It uses a two-stage pipeline: a vision-language model reconstructs actions and context, and a language model generates structured, high-fidelity suggestions. In evaluation, InvisibleMentor accurately identified inefficient workflows, and participants found its suggestions more actionable, tailored, and more helpful for learning and improvement compared to a prompt-based spreadsheet assistant.

Paper Structure

This paper contains 81 sections, 5 figures.

Figures (5)

  • Figure 1: InvisibleMentor's pipeline for generating suggestions from a screen recording. The system operates in two phases. (➊) A vision-language model (VLM) processes screen recordings sampled every 5 seconds to extract structured task representations, including user actions and spreadsheet context (➋). These representations are grouped into workflows and passed to a language model (LLM), which analyzes them to identify inefficiencies and generate actionable suggestions. Each suggestion includes a sequence of inefficient workflow, a rationale, a step-by-step suggestion (➌).
  • Figure 2: User interface of a spreadsheet assistant that provides structured workflow guidance. The assistant appears in a task pane alongside the spreadsheet, following the familiar layout of Excel Copilot to minimize design variance that could influence user study outcomes. The interface consists of five key components. (➊) Prompt ideas offer users high-level suggestions to initiate help-seeking. Selecting an idea appends a message to the conversation and triggers a structured assistant response. To streamline access to suggestions, we replaced the original fifth idea with a dedicated entry point for requesting recommendations. The remaining components are rendered dynamically based on model output: (➋) observed user actions, summarizing recent spreadsheet activity; (➌) workflow limitations, explaining potential inefficiencies; (➍) actionable suggestions, presenting step-by-step improvements; and (➎) a "give me another suggestion" button, allowing users to request alternatives.
  • Figure 3: Participant ratings of InvisibleMentor's suggestions. Stacked bar charts show agreement levels with eight evaluative statements on a 5-point Likert scale (from "Strongly Disagree" to "Strongly Agree"). Across all statements, participants rated InvisibleMentor's suggestions as significantly more useful, understandable, and better aligned with their recent tasks than those from the baseline system.
  • Figure 4: Participants' comparative preferences between InvisibleMentor and the baseline. Participants were asked which tool required less effort, which they would overall more prefer to use in the future, and which they trusted more with regard to privacy. Most participants significantly preferred InvisibleMentor for effort and rated it as overall more preferable to use, with no significant difference in perceived privacy confidence.
  • Figure 5: Relationship between video duration and VLM processing time. Each dot represents one screen recording session from Evaluation 1. The x-axis represents the duration of the cropped input video, and the y-axis shows the time taken by the VLM to process that session. A fitted linear regression line indicates a strong positive correlation ($R^2 = 0.883$).