Table of Contents
Fetching ...

Bi-Fact: A Bidirectional Factorization-based Evaluation of Intent Extraction from UI Trajectories

Sapir Caduri, Anatoly Efros, Noam Kahlon, Danielle Cohen, Yoni Halpern, Ido Dagan

TL;DR

Bi-Fact introduces a fact-level evaluation for UI-driven intent extraction by decomposing intents into atomic facts and evaluating bidirectional support between gold and predicted intents. It employs an LLM-driven automatic evaluation workflow with three stages of reasoning for predicting, recalling, and assessing factual coverage. On two human-judged datasets, Bi-Fact achieves higher agreement with human judgments than lexical, ROUGE, and NLI baselines, e.g., $F1=0.722$, $Kappa=0.508$, and $r=0.781$ (p<0.001) for fact-level correlation. The approach offers a robust, granular metric that can improve downstream personalization and proactive UI assistance by better capturing fine-grained intent details.

Abstract

Evaluating intent extraction from GUIs demands accurate, fine-grained metrics. This paper introduces Bi-Fact, a novel method that decomposes intents into atomic facts and performs bidirectional comparisons to assess precision and recall. Experiments demonstrate Bi-Fact's superior correlation with human judgments compared to existing metrics, establishing a more robust evaluation framework for UI-driven intent understanding.

Bi-Fact: A Bidirectional Factorization-based Evaluation of Intent Extraction from UI Trajectories

TL;DR

Bi-Fact introduces a fact-level evaluation for UI-driven intent extraction by decomposing intents into atomic facts and evaluating bidirectional support between gold and predicted intents. It employs an LLM-driven automatic evaluation workflow with three stages of reasoning for predicting, recalling, and assessing factual coverage. On two human-judged datasets, Bi-Fact achieves higher agreement with human judgments than lexical, ROUGE, and NLI baselines, e.g., , , and (p<0.001) for fact-level correlation. The approach offers a robust, granular metric that can improve downstream personalization and proactive UI assistance by better capturing fine-grained intent details.

Abstract

Evaluating intent extraction from GUIs demands accurate, fine-grained metrics. This paper introduces Bi-Fact, a novel method that decomposes intents into atomic facts and performs bidirectional comparisons to assess precision and recall. Experiments demonstrate Bi-Fact's superior correlation with human judgments compared to existing metrics, establishing a more robust evaluation framework for UI-driven intent understanding.

Paper Structure

This paper contains 12 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Example of Bi-Fact Evaluation Process: Comparing a Gold (reference) intent with a Predicted (model-generated) intent. Each intent is decomposed into atomic facts. Facts are compared bidirectionally for their support by the other text, as represented with checkmarks (✓) for implied facts and crosses (✗) for missed facts. Recall, precision, and F1 scores are computed from the bidirectional comparison.
  • Figure 2: Example prompt for the fact decomposition task
  • Figure 3: Assessment prompt part 1 - 3 step instructions
  • Figure 4: Assessment prompt part 2 - Output Structure
  • Figure 5: Assessment prompt part 3 - An Example (one-shot)