Table of Contents
Fetching ...

FormGym: Doing Paperwork with Agents

Matthew Toles, Rattandeep Singh, Isaac Song Zhou Yu

TL;DR

FormGym addresses the challenge of automatic end-to-end form filling on paper-image documents without OCR or DOM access. It introduces a realistic benchmark assembled from FUNSD, XFUND, Form-NLU, and a new Auto Loans dataset, plus an open-vocabulary FieldFinder localizer to guide where to place text. Empirical results show current vision-language agents perform poorly on field localization, while FieldFinder consistently boosts accuracy across models and tasks, reducing localization bottlenecks. The work contributes a benchmark, a localization tool, and release-ready dataset/code, with implications for automating paperwork in legal and financial domains.

Abstract

Completing paperwork is a challenging and time-consuming problem. Form filling is especially challenging in the pure-image domain without access to OCR, typeset PDF text, or a DOM. For computer agents, it requires multiple abilities, including multi-modal understanding, information retrieval, and tool-use. We present a novel form-filling benchmark consisting of 432 fields spread across 55 documents and 3 tasks, requiring knowledge of 236 features per user. We find that baseline VLAs achieve less than 1% accuracy in most cases, primarily due to poor localization ability. GUI agents also struggle, scoring between 10.6-68.0% despite high cost and latency. Therefore, we also contribute FieldFinder, a tool to assist LLMs in identifying where to place text on a form. With FieldFinder, all models achieve equal or better performance in all six study conditions, with a maximum increase from 2% to 56%.

FormGym: Doing Paperwork with Agents

TL;DR

FormGym addresses the challenge of automatic end-to-end form filling on paper-image documents without OCR or DOM access. It introduces a realistic benchmark assembled from FUNSD, XFUND, Form-NLU, and a new Auto Loans dataset, plus an open-vocabulary FieldFinder localizer to guide where to place text. Empirical results show current vision-language agents perform poorly on field localization, while FieldFinder consistently boosts accuracy across models and tasks, reducing localization bottlenecks. The work contributes a benchmark, a localization tool, and release-ready dataset/code, with implications for automating paperwork in legal and financial domains.

Abstract

Completing paperwork is a challenging and time-consuming problem. Form filling is especially challenging in the pure-image domain without access to OCR, typeset PDF text, or a DOM. For computer agents, it requires multiple abilities, including multi-modal understanding, information retrieval, and tool-use. We present a novel form-filling benchmark consisting of 432 fields spread across 55 documents and 3 tasks, requiring knowledge of 236 features per user. We find that baseline VLAs achieve less than 1% accuracy in most cases, primarily due to poor localization ability. GUI agents also struggle, scoring between 10.6-68.0% despite high cost and latency. Therefore, we also contribute FieldFinder, a tool to assist LLMs in identifying where to place text on a form. With FieldFinder, all models achieve equal or better performance in all six study conditions, with a maximum increase from 2% to 56%.

Paper Structure

This paper contains 31 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: In the FormGym task, agents are provided with an unfilled source form and a user persona containing answers to fields in the form. The agent must use an editor API to produce the completed source form. Diverse layout semantics indicate suggested fields, such as underlines, colons, check boxes, and table cells.
  • Figure 2: Example forms and field bounding boxes in the FormGym dataset.
  • Figure 3: In the baseline case, the LLM receives an unfilled source form and persona information in its context and attempts to complete the form through a text placement API based on (x, y) coordinates. In the FieldFinder (ours) case, the text placement API is replaced with the FieldFinder tool. Instead of using coordinates as input, the FieldFinder tool takes the name of a field as input, then uses an open vocabulary object detection model to detect the corresponding field boudning box. In the GUI Agent case, the GUI agent uses an in-browser image editing tool (designed for humans) to place text on the PDf.
  • Figure 4: FieldFinder accuracy vs. fields per form (log scale). Trend line shown for English datasets only. The trend suggests that high numbers of fields per form and multi-lingual forms are the greatest challenges for FieldFinder.
  • Figure 5: Example output by Claude 4 baseline, with FieldFinder (ours), and ground truth in the Auto Loans (Text) One-Shot task. We attribute FieldFinder’s leftward bias to supervision artifacts: training labels mark left-biased value text rather than full fields. Without FieldFinder, Claude appears to struggle more with horizontal spacing that with vertical spacing, assigning most placements an x coordinate of exactly 0.5 (not centered due to figure cropping).
  • ...and 2 more figures