Table of Contents
Fetching ...

GAICo: A Deployed and Extensible Framework for Evaluating Diverse and Multimodal Generative AI Outputs

Nitin Gupta, Pallav Koppisetti, Kausik Lakkaraju, Biplav Srivastava

TL;DR

GAICo tackles the lack of standardized evaluation tools for GenAI outputs across modalities by providing a deployed, extensible framework. It unifies a broad metric library and a streamlined Experiment API to enable end-to-end, post-hoc comparisons. A case study on composite AI Travel Assistants demonstrates how GAICo isolates plan-generation quality from modality-generation quality, accelerating debugging and reliability improvements. With PyPI deployment and active documentation, GAICo aims to accelerate safer AI deployments by improving reproducibility and development velocity.

Abstract

The rapid proliferation of Generative AI (GenAI) into diverse, high-stakes domains necessitates robust and reproducible evaluation methods. However, practitioners often resort to ad-hoc, non-standardized scripts, as common metrics are often unsuitable for specialized, structured outputs (e.g., automated plans, time-series) or holistic comparison across modalities (e.g., text, audio, and image). This fragmentation hinders comparability and slows AI system development. To address this challenge, we present GAICo (Generative AI Comparator): a deployed, open-source Python library that streamlines and standardizes GenAI output comparison. GAICo provides a unified, extensible framework supporting a comprehensive suite of reference-based metrics for unstructured text, specialized structured data formats, and multimedia (images, audio). Its architecture features a high-level API for rapid, end-to-end analysis, from multi-model comparison to visualization and reporting, alongside direct metric access for granular control. We demonstrate GAICo's utility through a detailed case study evaluating and debugging complex, multi-modal AI Travel Assistant pipelines. GAICo empowers AI researchers and developers to efficiently assess system performance, make evaluation reproducible, improve development velocity, and ultimately build more trustworthy AI systems, aligning with the goal of moving faster and safer in AI deployment. Since its release on PyPI in Jun 2025, the tool has been downloaded over 13K times, across versions, by Aug 2025, demonstrating growing community interest.

GAICo: A Deployed and Extensible Framework for Evaluating Diverse and Multimodal Generative AI Outputs

TL;DR

GAICo tackles the lack of standardized evaluation tools for GenAI outputs across modalities by providing a deployed, extensible framework. It unifies a broad metric library and a streamlined Experiment API to enable end-to-end, post-hoc comparisons. A case study on composite AI Travel Assistants demonstrates how GAICo isolates plan-generation quality from modality-generation quality, accelerating debugging and reliability improvements. With PyPI deployment and active documentation, GAICo aims to accelerate safer AI deployments by improving reproducibility and development velocity.

Abstract

The rapid proliferation of Generative AI (GenAI) into diverse, high-stakes domains necessitates robust and reproducible evaluation methods. However, practitioners often resort to ad-hoc, non-standardized scripts, as common metrics are often unsuitable for specialized, structured outputs (e.g., automated plans, time-series) or holistic comparison across modalities (e.g., text, audio, and image). This fragmentation hinders comparability and slows AI system development. To address this challenge, we present GAICo (Generative AI Comparator): a deployed, open-source Python library that streamlines and standardizes GenAI output comparison. GAICo provides a unified, extensible framework supporting a comprehensive suite of reference-based metrics for unstructured text, specialized structured data formats, and multimedia (images, audio). Its architecture features a high-level API for rapid, end-to-end analysis, from multi-model comparison to visualization and reporting, alongside direct metric access for granular control. We demonstrate GAICo's utility through a detailed case study evaluating and debugging complex, multi-modal AI Travel Assistant pipelines. GAICo empowers AI researchers and developers to efficiently assess system performance, make evaluation reproducible, improve development velocity, and ultimately build more trustworthy AI systems, aligning with the goal of moving faster and safer in AI deployment. Since its release on PyPI in Jun 2025, the tool has been downloaded over 13K times, across versions, by Aug 2025, demonstrating growing community interest.

Paper Structure

This paper contains 33 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The multi-modal GAICo workflow. The framework processes answers from multi-modal (text, image, audio) AI models, computes pairwise similarity scores ($s_{kl}$), and constructs several outputs: raw data reports, visualizations, and pass/fail assessments against a threshold $\delta$. (any distance function, or conversely, $1 - \text{similarity metric}$).
  • Figure 2: Illustration of GAICo for a composite AI system case study. (Right) A user uses LLMs to get results tailored to their needs and then analyzes them with GAICo. They start the process in three parallel pipelines that generate multi-modal outputs (JSON plan, image, audio). The user then performs a two-part evaluation: (a) Plan Coherence, comparing the JSON outputs of all pipelines against a single Baseline Plan Reference (derived from Pipeline A's output) and (b) Modality Quality, comparing each pipeline's generated image and audio against per-pipeline references. These per-pipeline references are generated by feeding the same prompts/scripts from the respective pipeline's JSON output into baseline specialist generators (from Pipeline A). (Left) The output radar plot shows image and audio fidelity relative to references derived from each pipeline’s prompts. For more examples, see Table \ref{['tab:example-notebooks']}.
  • Figure 3: Radar plots generated by GAICo comparing pipeline performance across various metrics. (Right) Modality Generation Quality, assessing the specialist models' fidelity against references generated from their own pipeline's prompts. Each axis is a metric, and each line represents a pipeline's averaged score. (Left) Plan Coherence, showing the relative strengths of orchestrator LLMs against a universal human-curated reference.
  • Figure 4: Prompt used by the orchestrator LLM. The prompt enforces JSON output, supplies slot-level requirements for each day, and instructs downstream specialist generators (image and TTS) via embedded sub-prompts.
  • Figure 5: Per-metric comparison of plan coherence across the three pipelines. Higher is better.
  • ...and 2 more figures