Table of Contents
Fetching ...

Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making

Oluyemi Enoch Amujo, Shanchieh Jay Yang

TL;DR

This paper addresses benchmarking large foundational models prior to domain-specific fine-tuning by comparing common versus domain-specific prompts across cybersecurity, medicine, and finance using Gemma-2B and Gemma-7B. It introduces ThroughCut, an outlier-detection framework for response throughput conciseness, and a two-layer methodology that measures inference time, response length, throughput, quality, and resource usage, with evaluation conducted over 16 configurations against a ChatGPT reference baseline. Key findings show that model size and prompt type strongly influence latency and output length, with correlation coefficients $R$ between inference time and response length; 2B models deliver higher throughput while 7B models can achieve competitive STS/ROUGE-L in some domains, and common prompts yield more variable and longer responses than domain-specific prompts, especially under length restrictions. The framework supports informed fine-tuning decisions and emphasizes multidomain benchmarking to improve the reliability of downstream domain adaptation in AI systems.

Abstract

Recently, large language models (LLMs) have expanded into various domains. However, there remains a need to evaluate how these models perform when prompted with commonplace queries compared to domain-specific queries, which may be useful for benchmarking prior to fine-tuning for domain-specific downstream tasks. This study evaluates LLMs, specifically Gemma-2B and Gemma-7B, across diverse domains, including cybersecurity, medicine, and finance, compared to common knowledge queries. This study utilizes a comprehensive methodology to assess foundational models, which includes problem formulation, data analysis, and the development of ThroughCut, a novel outlier detection technique that automatically identifies response throughput outliers based on their conciseness. This methodological rigor enhances the credibility of the presented evaluation frameworks. This study focused on assessing inference time, response length, throughput, quality, and resource utilization and investigated the correlations between these factors. The results indicate that model size and types of prompts used for inference significantly influenced response length and quality. In addition, common prompts, which include various types of queries, generate diverse and inconsistent responses at irregular intervals. In contrast, domain-specific prompts consistently generate concise responses within a reasonable time. Overall, this study underscores the need for comprehensive evaluation frameworks to enhance the reliability of benchmarking procedures in multidomain AI research.

Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making

TL;DR

This paper addresses benchmarking large foundational models prior to domain-specific fine-tuning by comparing common versus domain-specific prompts across cybersecurity, medicine, and finance using Gemma-2B and Gemma-7B. It introduces ThroughCut, an outlier-detection framework for response throughput conciseness, and a two-layer methodology that measures inference time, response length, throughput, quality, and resource usage, with evaluation conducted over 16 configurations against a ChatGPT reference baseline. Key findings show that model size and prompt type strongly influence latency and output length, with correlation coefficients between inference time and response length; 2B models deliver higher throughput while 7B models can achieve competitive STS/ROUGE-L in some domains, and common prompts yield more variable and longer responses than domain-specific prompts, especially under length restrictions. The framework supports informed fine-tuning decisions and emphasizes multidomain benchmarking to improve the reliability of downstream domain adaptation in AI systems.

Abstract

Recently, large language models (LLMs) have expanded into various domains. However, there remains a need to evaluate how these models perform when prompted with commonplace queries compared to domain-specific queries, which may be useful for benchmarking prior to fine-tuning for domain-specific downstream tasks. This study evaluates LLMs, specifically Gemma-2B and Gemma-7B, across diverse domains, including cybersecurity, medicine, and finance, compared to common knowledge queries. This study utilizes a comprehensive methodology to assess foundational models, which includes problem formulation, data analysis, and the development of ThroughCut, a novel outlier detection technique that automatically identifies response throughput outliers based on their conciseness. This methodological rigor enhances the credibility of the presented evaluation frameworks. This study focused on assessing inference time, response length, throughput, quality, and resource utilization and investigated the correlations between these factors. The results indicate that model size and types of prompts used for inference significantly influenced response length and quality. In addition, common prompts, which include various types of queries, generate diverse and inconsistent responses at irregular intervals. In contrast, domain-specific prompts consistently generate concise responses within a reasonable time. Overall, this study underscores the need for comprehensive evaluation frameworks to enhance the reliability of benchmarking procedures in multidomain AI research.
Paper Structure (15 sections, 11 equations, 5 figures, 2 tables)

This paper contains 15 sections, 11 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: A Gemma-2B architecture showing the salient components
  • Figure 2: Salient parameters of Gemma model. Source: team2024gemma
  • Figure 3: A framework for a large foundational model assessment about a domain understanding
  • Figure 4: A framework for the implementation of a large foundational model assessment about a domain understanding
  • Figure 5: Inference time (s) and response word length plots, estimating the correlation coefficient $(R)$, central line, upper and lower bounds, and outliers. The Common model had the highest number of outliers in all cases compared to the domain-specific responses.