FormGym: Doing Paperwork with Agents
Matthew Toles, Rattandeep Singh, Isaac Song Zhou Yu
TL;DR
FormGym addresses the challenge of automatic end-to-end form filling on paper-image documents without OCR or DOM access. It introduces a realistic benchmark assembled from FUNSD, XFUND, Form-NLU, and a new Auto Loans dataset, plus an open-vocabulary FieldFinder localizer to guide where to place text. Empirical results show current vision-language agents perform poorly on field localization, while FieldFinder consistently boosts accuracy across models and tasks, reducing localization bottlenecks. The work contributes a benchmark, a localization tool, and release-ready dataset/code, with implications for automating paperwork in legal and financial domains.
Abstract
Completing paperwork is a challenging and time-consuming problem. Form filling is especially challenging in the pure-image domain without access to OCR, typeset PDF text, or a DOM. For computer agents, it requires multiple abilities, including multi-modal understanding, information retrieval, and tool-use. We present a novel form-filling benchmark consisting of 432 fields spread across 55 documents and 3 tasks, requiring knowledge of 236 features per user. We find that baseline VLAs achieve less than 1% accuracy in most cases, primarily due to poor localization ability. GUI agents also struggle, scoring between 10.6-68.0% despite high cost and latency. Therefore, we also contribute FieldFinder, a tool to assist LLMs in identifying where to place text on a form. With FieldFinder, all models achieve equal or better performance in all six study conditions, with a maximum increase from 2% to 56%.
