V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM
Abdur Rahman, Rajat Chawla, Muskaan Kumar, Arkajit Datta, Adarsh Jha, Mukunda NS, Ishaan Bhola
TL;DR
V-Zen presents a dual-resolution multimodal architecture for GUI understanding and precise grounding, integrating a low-resolution visual encoder, a multimodal projection pathway, a visual-expert-enhanced LLM, a high-resolution cross-visual module, and a DINO-based grounding head. The GUIDE dataset complements training with real-world GUI images, action histories, and chain-of-thought annotations to support specialized fine-tuning. Empirical results show strong performance in next-action prediction and grounding, outperforming several state-of-the-art models and demonstrating the value of open-set grounding at higher resolutions. The work aims to enable self-operating GUI agents and invites open collaboration through released code, data, and models to accelerate multimodal GUI automation research.
Abstract
In the rapidly evolving landscape of AI research and application, Multimodal Large Language Models (MLLMs) have emerged as a transformative force, adept at interpreting and integrating information from diverse modalities such as text, images, and Graphical User Interfaces (GUIs). Despite these advancements, the nuanced interaction and understanding of GUIs pose a significant challenge, limiting the potential of existing models to enhance automation levels. To bridge this gap, this paper presents V-Zen, an innovative Multimodal Large Language Model (MLLM) meticulously crafted to revolutionise the domain of GUI understanding and grounding. Equipped with dual-resolution image encoders, V-Zen establishes new benchmarks in efficient grounding and next-action prediction, thereby laying the groundwork for self-operating computer systems. Complementing V-Zen is the GUIDE dataset, an extensive collection of real-world GUI elements and task-based sequences, serving as a catalyst for specialised fine-tuning. The successful integration of V-Zen and GUIDE marks the dawn of a new era in multimodal AI research, opening the door to intelligent, autonomous computing experiences. This paper extends an invitation to the research community to join this exciting journey, shaping the future of GUI automation. In the spirit of open science, our code, data, and model will be made publicly available, paving the way for multimodal dialogue scenarios with intricate and precise interactions.
