Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making
Oluyemi Enoch Amujo, Shanchieh Jay Yang
TL;DR
This paper addresses benchmarking large foundational models prior to domain-specific fine-tuning by comparing common versus domain-specific prompts across cybersecurity, medicine, and finance using Gemma-2B and Gemma-7B. It introduces ThroughCut, an outlier-detection framework for response throughput conciseness, and a two-layer methodology that measures inference time, response length, throughput, quality, and resource usage, with evaluation conducted over 16 configurations against a ChatGPT reference baseline. Key findings show that model size and prompt type strongly influence latency and output length, with correlation coefficients $R$ between inference time and response length; 2B models deliver higher throughput while 7B models can achieve competitive STS/ROUGE-L in some domains, and common prompts yield more variable and longer responses than domain-specific prompts, especially under length restrictions. The framework supports informed fine-tuning decisions and emphasizes multidomain benchmarking to improve the reliability of downstream domain adaptation in AI systems.
Abstract
Recently, large language models (LLMs) have expanded into various domains. However, there remains a need to evaluate how these models perform when prompted with commonplace queries compared to domain-specific queries, which may be useful for benchmarking prior to fine-tuning for domain-specific downstream tasks. This study evaluates LLMs, specifically Gemma-2B and Gemma-7B, across diverse domains, including cybersecurity, medicine, and finance, compared to common knowledge queries. This study utilizes a comprehensive methodology to assess foundational models, which includes problem formulation, data analysis, and the development of ThroughCut, a novel outlier detection technique that automatically identifies response throughput outliers based on their conciseness. This methodological rigor enhances the credibility of the presented evaluation frameworks. This study focused on assessing inference time, response length, throughput, quality, and resource utilization and investigated the correlations between these factors. The results indicate that model size and types of prompts used for inference significantly influenced response length and quality. In addition, common prompts, which include various types of queries, generate diverse and inconsistent responses at irregular intervals. In contrast, domain-specific prompts consistently generate concise responses within a reasonable time. Overall, this study underscores the need for comprehensive evaluation frameworks to enhance the reliability of benchmarking procedures in multidomain AI research.
