Table of Contents
Fetching ...

Quantifying the Capabilities of LLMs across Scale and Precision

Sher Badshah, Hassan Sajjad

TL;DR

This study evaluates how LLM performance scales with parameter count and resilience to precision reduction via quantization, comparing two open-source families (Llama 2-Chat and Mistral Mixtral/Instruct) across 7B–70B scales and 4-bit–32-bit precision. Using zero-shot prompts across tasks including reasoning, NLU, and misinformation detection, it finds a positive correlation between size and performance in most cases, though some reasoning tasks show limited benefits from scaling. Importantly, larger models demonstrate strong tolerance to aggressive quantization, maintaining high accuracy at 4-bit in many scenarios, which often surpasses the performance of smaller models at higher precision under the same memory budget. The results inform deployment decisions for resource-constrained settings, suggesting that a larger model with 4-bit quantization generally yields better efficiency-accuracy trade-offs than smaller, higher-precision models, with caveats related to task type and prompting strategy. The work contributes practical guidance for scalable, efficient LLM deployment and highlights areas for further study on task-specific scaling and quantization effects.

Abstract

Scale is often attributed as one of the factors that cause an increase in the performance of LLMs, resulting in models with billion and trillion parameters. One of the limitations of such large models is the high computational requirements that limit their usage, deployment, and debugging in resource-constrained scenarios. Two commonly used alternatives to bypass these limitations are to use the smaller versions of LLMs (e.g. Llama 7B instead of Llama 70B) and lower the memory requirements by using quantization. While these approaches effectively address the limitation of resources, their impact on model performance needs thorough examination. In this study, we perform a comprehensive evaluation to investigate the effect of model scale and quantization on the performance. We experiment with two major families of open-source instruct models ranging from 7 billion to 70 billion parameters. Our extensive zero-shot experiments across various tasks including natural language understanding, reasoning, misinformation detection, and hallucination reveal that larger models generally outperform their smaller counterparts, suggesting that scale remains an important factor in enhancing performance. We found that larger models show exceptional resilience to precision reduction and can maintain high accuracy even at 4-bit quantization for numerous tasks and they serve as a better solution than using smaller models at high precision under similar memory requirements.

Quantifying the Capabilities of LLMs across Scale and Precision

TL;DR

This study evaluates how LLM performance scales with parameter count and resilience to precision reduction via quantization, comparing two open-source families (Llama 2-Chat and Mistral Mixtral/Instruct) across 7B–70B scales and 4-bit–32-bit precision. Using zero-shot prompts across tasks including reasoning, NLU, and misinformation detection, it finds a positive correlation between size and performance in most cases, though some reasoning tasks show limited benefits from scaling. Importantly, larger models demonstrate strong tolerance to aggressive quantization, maintaining high accuracy at 4-bit in many scenarios, which often surpasses the performance of smaller models at higher precision under the same memory budget. The results inform deployment decisions for resource-constrained settings, suggesting that a larger model with 4-bit quantization generally yields better efficiency-accuracy trade-offs than smaller, higher-precision models, with caveats related to task type and prompting strategy. The work contributes practical guidance for scalable, efficient LLM deployment and highlights areas for further study on task-specific scaling and quantization effects.

Abstract

Scale is often attributed as one of the factors that cause an increase in the performance of LLMs, resulting in models with billion and trillion parameters. One of the limitations of such large models is the high computational requirements that limit their usage, deployment, and debugging in resource-constrained scenarios. Two commonly used alternatives to bypass these limitations are to use the smaller versions of LLMs (e.g. Llama 7B instead of Llama 70B) and lower the memory requirements by using quantization. While these approaches effectively address the limitation of resources, their impact on model performance needs thorough examination. In this study, we perform a comprehensive evaluation to investigate the effect of model scale and quantization on the performance. We experiment with two major families of open-source instruct models ranging from 7 billion to 70 billion parameters. Our extensive zero-shot experiments across various tasks including natural language understanding, reasoning, misinformation detection, and hallucination reveal that larger models generally outperform their smaller counterparts, suggesting that scale remains an important factor in enhancing performance. We found that larger models show exceptional resilience to precision reduction and can maintain high accuracy even at 4-bit quantization for numerous tasks and they serve as a better solution than using smaller models at high precision under similar memory requirements.
Paper Structure (32 sections, 8 figures, 9 tables)

This paper contains 32 sections, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Performance of Llama 2-Chat and Mistral models across reasoning tasks operating under FP16 precision
  • Figure 2: Effect of 4 and 8-bit quantization on models reasoning compared to half-precision
  • Figure 3: Performance of Mistral and Llama 2-Chat models on TruthfulQA lin2021truthfulqa at scale and precision
  • Figure 4: Performance of both model families on COVID-19 fact-checking lee2021towards across precisions
  • Figure 5: ROUGE-1 scores of Llama 2-Chat and Mistral models on summarization tasks in different precisions
  • ...and 3 more figures