Table of Contents
Fetching ...

A Multimodal GUI Architecture for Interfacing with LLM-Based Conversational Assistants

Hans G. W. van Dam

TL;DR

The paper tackles the challenge of making graphical user interfaces accessible via natural language by proposing an MCP-driven architecture that tightly couples GUI semantics with LLM-based conversational assistants. It centers the MVVM-based ViewModel and a GUI Tree Router to expose application capabilities as tools, enabling coordinated speech and visual feedback, with embedded and OS-level assistant pathways. Through evaluations of open-weight LLMs, it demonstrates feasible local deployment with acceptable latency on enterprise hardware and identifies post-processing techniques to improve accuracy. The work highlights the practical impact of semantic exposure (via MCP) for privacy-preserving, reliable, multimodal interactions and lays groundwork for future OS-wide super assistants that can orchestrate tasks across multiple applications.

Abstract

Advances in large language models (LLMs) and real-time speech recognition now make it possible to issue any graphical user interface (GUI) action through natural language and receive the corresponding system response directly through the GUI. Most production applications were never designed with speech in mind. This article provides a concrete architecture that enables GUIs to interface with LLM-based speech-enabled assistants. The architecture makes an application's navigation graph and semantics available through the Model Context Protocol (MCP). The ViewModel, part of the MVVM (Model-View-ViewModel) pattern, exposes the application's capabilities to the assistant by supplying both tools applicable to a currently visible view and application-global tools extracted from the GUI tree router. This architecture facilitates full voice accessibility while ensuring reliable alignment between spoken input and the visual interface, accompanied by consistent feedback across modalities. It future-proofs apps for upcoming OS super assistants that employ computer use agents (CUAs) and natively consume MCP if an application provides it. To address concerns about privacy and data security, the practical effectiveness of locally deployable, open-weight LLMs for speech-enabled multimodal UIs is evaluated. Findings suggest that recent smaller open-weight models approach the performance of leading proprietary models in overall accuracy and require enterprise-grade hardware for fast responsiveness. A demo implementation of the proposed architecture can be found at https://github.com/hansvdam/langbar

A Multimodal GUI Architecture for Interfacing with LLM-Based Conversational Assistants

TL;DR

The paper tackles the challenge of making graphical user interfaces accessible via natural language by proposing an MCP-driven architecture that tightly couples GUI semantics with LLM-based conversational assistants. It centers the MVVM-based ViewModel and a GUI Tree Router to expose application capabilities as tools, enabling coordinated speech and visual feedback, with embedded and OS-level assistant pathways. Through evaluations of open-weight LLMs, it demonstrates feasible local deployment with acceptable latency on enterprise hardware and identifies post-processing techniques to improve accuracy. The work highlights the practical impact of semantic exposure (via MCP) for privacy-preserving, reliable, multimodal interactions and lays groundwork for future OS-wide super assistants that can orchestrate tasks across multiple applications.

Abstract

Advances in large language models (LLMs) and real-time speech recognition now make it possible to issue any graphical user interface (GUI) action through natural language and receive the corresponding system response directly through the GUI. Most production applications were never designed with speech in mind. This article provides a concrete architecture that enables GUIs to interface with LLM-based speech-enabled assistants. The architecture makes an application's navigation graph and semantics available through the Model Context Protocol (MCP). The ViewModel, part of the MVVM (Model-View-ViewModel) pattern, exposes the application's capabilities to the assistant by supplying both tools applicable to a currently visible view and application-global tools extracted from the GUI tree router. This architecture facilitates full voice accessibility while ensuring reliable alignment between spoken input and the visual interface, accompanied by consistent feedback across modalities. It future-proofs apps for upcoming OS super assistants that employ computer use agents (CUAs) and natively consume MCP if an application provides it. To address concerns about privacy and data security, the practical effectiveness of locally deployable, open-weight LLMs for speech-enabled multimodal UIs is evaluated. Findings suggest that recent smaller open-weight models approach the performance of leading proprietary models in overall accuracy and require enterprise-grade hardware for fast responsiveness. A demo implementation of the proposed architecture can be found at https://github.com/hansvdam/langbar

Paper Structure

This paper contains 42 sections, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Simplified flow of a CUA assistant.
  • Figure 2: Hybrid assistance, where enhanced GUIs explicitly expose their semantics through callable tools and conventional GUIs are observed through screenshots and operated on using mouse and keyboard automation.
  • Figure 3: A linguistic expression translated to app action: navigation to the right screen and filling out parameters. Adapted from vanDam2023synergy
  • Figure 4: The result of saying 'Put the triangle to fit exactly into the circle.' in a fictitious speech-enabled drawing application.
  • Figure 5: A futuristic control room application, with many panels. It is both complex and rigid, making speech enablement useful and feasible.
  • ...and 12 more figures