Table of Contents
Fetching ...

AQuA: Automated Question-Answering in Software Tutorial Videos with Visual Anchors

Saelyne Yang, Jo Vermeulen, George Fitzmaurice, Justin Matejka

TL;DR

AQuA, a pipeline that generates useful answers to questions with visual anchors is built that can recognize UI elements in visual anchors and generate answers using GPT-4 augmented with that visual information and software documentation.

Abstract

Tutorial videos are a popular help source for learning feature-rich software. However, getting quick answers to questions about tutorial videos is difficult. We present an automated approach for responding to tutorial questions. By analyzing 633 questions found in 5,944 video comments, we identified different question types and observed that users frequently described parts of the video in questions. We then asked participants (N=24) to watch tutorial videos and ask questions while annotating the video with relevant visual anchors. Most visual anchors referred to UI elements and the application workspace. Based on these insights, we built AQuA, a pipeline that generates useful answers to questions with visual anchors. We demonstrate this for Fusion 360, showing that we can recognize UI elements in visual anchors and generate answers using GPT-4 augmented with that visual information and software documentation. An evaluation study (N=16) demonstrates that our approach provides better answers than baseline methods.

AQuA: Automated Question-Answering in Software Tutorial Videos with Visual Anchors

TL;DR

AQuA, a pipeline that generates useful answers to questions with visual anchors is built that can recognize UI elements in visual anchors and generate answers using GPT-4 augmented with that visual information and software documentation.

Abstract

Tutorial videos are a popular help source for learning feature-rich software. However, getting quick answers to questions about tutorial videos is difficult. We present an automated approach for responding to tutorial questions. By analyzing 633 questions found in 5,944 video comments, we identified different question types and observed that users frequently described parts of the video in questions. We then asked participants (N=24) to watch tutorial videos and ask questions while annotating the video with relevant visual anchors. Most visual anchors referred to UI elements and the application workspace. Based on these insights, we built AQuA, a pipeline that generates useful answers to questions with visual anchors. We demonstrate this for Fusion 360, showing that we can recognize UI elements in visual anchors and generate answers using GPT-4 augmented with that visual information and software documentation. An evaluation study (N=16) demonstrates that our approach provides better answers than baseline methods.
Paper Structure (54 sections, 7 figures, 6 tables)

This paper contains 54 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Categories and Types of questions identified from the analysis. Each row represents a category and each block represents a type. Under each block, the areas on the left and right represent live chat and comment data, respectively. Our focus is on Content and User questions, as these are vital for comprehending the tutorial and can often be answered without the involvement of the tutorial authors or software vendor.
  • Figure 2: The system used for collecting questions with visual references. (A) Users can draw anchors on parts of the video they want to ask questions about, (B) which will be added to a temporary gallery. (C) Users can refer to each anchor in their questions.
  • Figure 3: Our Visual Recognition Module is composed of Image Captioning, UI Element Detection, and Optical Character Recognition (OCR). We use BLIP-2 li2023blip2 to obtain a general description of the visual anchor in case it contains generic or workspace objects, and the Google Cloud Vision API cloud-vision-api to detect any textual information in the anchor. For UI Element Detection, we first run UIED uied to determine if there are multiple UI elements in the anchor. Then, we apply feature matching and template matching between each element in the anchor and those in the UI database. If the matching score exceeds a certain threshold, we retrieve the element's name.
  • Figure 4: The system used in our pipeline evaluation study. The participant can see the question, the video that the question was asked about at the right timestamp and with the visual anchor highlighted, and three generated answers in random order. They were asked to rate each answer in terms of its correctness and helpfulness on a scale of 1 to 7, and select their favorite answer among the three. Optionally, they could provide reasons for selecting their favorite answer.
  • Figure 5: Distribution of Likert scale responses on Correctness and Helpfulness. Full Pipeline shows the highest correctness and helpfulness scores in both batches. Responses of "neither agree nor disagree" are omitted from the chart for clarity and readability.
  • ...and 2 more figures