Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making

Oluyemi Enoch Amujo; Shanchieh Jay Yang

Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making

Oluyemi Enoch Amujo, Shanchieh Jay Yang

TL;DR

This paper addresses benchmarking large foundational models prior to domain-specific fine-tuning by comparing common versus domain-specific prompts across cybersecurity, medicine, and finance using Gemma-2B and Gemma-7B. It introduces ThroughCut, an outlier-detection framework for response throughput conciseness, and a two-layer methodology that measures inference time, response length, throughput, quality, and resource usage, with evaluation conducted over 16 configurations against a ChatGPT reference baseline. Key findings show that model size and prompt type strongly influence latency and output length, with correlation coefficients $R$ between inference time and response length; 2B models deliver higher throughput while 7B models can achieve competitive STS/ROUGE-L in some domains, and common prompts yield more variable and longer responses than domain-specific prompts, especially under length restrictions. The framework supports informed fine-tuning decisions and emphasizes multidomain benchmarking to improve the reliability of downstream domain adaptation in AI systems.

Abstract

Recently, large language models (LLMs) have expanded into various domains. However, there remains a need to evaluate how these models perform when prompted with commonplace queries compared to domain-specific queries, which may be useful for benchmarking prior to fine-tuning for domain-specific downstream tasks. This study evaluates LLMs, specifically Gemma-2B and Gemma-7B, across diverse domains, including cybersecurity, medicine, and finance, compared to common knowledge queries. This study utilizes a comprehensive methodology to assess foundational models, which includes problem formulation, data analysis, and the development of ThroughCut, a novel outlier detection technique that automatically identifies response throughput outliers based on their conciseness. This methodological rigor enhances the credibility of the presented evaluation frameworks. This study focused on assessing inference time, response length, throughput, quality, and resource utilization and investigated the correlations between these factors. The results indicate that model size and types of prompts used for inference significantly influenced response length and quality. In addition, common prompts, which include various types of queries, generate diverse and inconsistent responses at irregular intervals. In contrast, domain-specific prompts consistently generate concise responses within a reasonable time. Overall, this study underscores the need for comprehensive evaluation frameworks to enhance the reliability of benchmarking procedures in multidomain AI research.

Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making

TL;DR

between inference time and response length; 2B models deliver higher throughput while 7B models can achieve competitive STS/ROUGE-L in some domains, and common prompts yield more variable and longer responses than domain-specific prompts, especially under length restrictions. The framework supports informed fine-tuning decisions and emphasizes multidomain benchmarking to improve the reliability of downstream domain adaptation in AI systems.

Abstract

Paper Structure (15 sections, 11 equations, 5 figures, 2 tables)

This paper contains 15 sections, 11 equations, 5 figures, 2 tables.

Introduction
Literature Review
Large Language Foundation Model (LLFM)
LLM Text Generation and Inference
Google Gemma Architecture
Methodology
Problem Formulation
Proposed Framework
Data Analysis
Formulation of Outlier Technique
Dataset
Result Discussion
Analysis of Response
Analysis of Correlation and Outliers
Conclusion

Figures (5)

Figure 1: A Gemma-2B architecture showing the salient components
Figure 2: Salient parameters of Gemma model. Source: team2024gemma
Figure 3: A framework for a large foundational model assessment about a domain understanding
Figure 4: A framework for the implementation of a large foundational model assessment about a domain understanding
Figure 5: Inference time (s) and response word length plots, estimating the correlation coefficient $(R)$, central line, upper and lower bounds, and outliers. The Common model had the highest number of outliers in all cases compared to the domain-specific responses.

Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making

TL;DR

Abstract

Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making

Authors

TL;DR

Abstract

Table of Contents

Figures (5)