Table of Contents
Fetching ...

Comparison of Open-Source and Proprietary LLMs for Machine Reading Comprehension: A Practical Analysis for Industrial Applications

Mahaman Sanoussi Yahaya Alassan, Jessica López Espejel, Merieme Bouhandi, Walid Dahhane, El Hassane Ettifouri

TL;DR

The paper addresses the practical challenge of selecting LLMs for industrial machine reading comprehension (MRC) by benchmarking open-source and proprietary models under resource-constrained deployment. It compares GPT-3.5/4 with open-source 7B variants (e.g., Mistral-7B-Instruct/OpenOrca, Llama-2-7B Chat, Dolphin-2_6-Phi-2) using a 40-sample, few-shot-generated MRC dataset evaluated on exact-match and ROUGE metrics, with CPU-based experiments and varied quantization. Key findings show GPT-4 achieving top performance, while open-source families offer competitive accuracy with favorable deployment characteristics, particularly under resource limits and private-infrastructure requirements; quantization and model size substantially affect accuracy, latency, and memory. The study demonstrates that open-source models can provide viable, cost-effective alternatives for regulated industries, guiding practitioners in balancing accuracy, speed, privacy, and total cost of ownership for industrial MRC deployments.

Abstract

Large Language Models (LLMs) have recently demonstrated remarkable performance in various Natural Language Processing (NLP) applications, such as sentiment analysis, content generation, and personalized recommendations. Despite their impressive capabilities, there remains a significant need for systematic studies concerning the practical application of LLMs in industrial settings, as well as the specific requirements and challenges related to their deployment in these contexts. This need is particularly critical for Machine Reading Comprehension (MCR), where factual, concise, and accurate responses are required. To date, most MCR rely on Small Language Models (SLMs) or Recurrent Neural Networks (RNNs) such as Long Short-Term Memory (LSTM). This trend is evident in the SQuAD2.0 rankings on the Papers with Code table. This article presents a comparative analysis between open-source LLMs and proprietary models on this task, aiming to identify light and open-source alternatives that offer comparable performance to proprietary models.

Comparison of Open-Source and Proprietary LLMs for Machine Reading Comprehension: A Practical Analysis for Industrial Applications

TL;DR

The paper addresses the practical challenge of selecting LLMs for industrial machine reading comprehension (MRC) by benchmarking open-source and proprietary models under resource-constrained deployment. It compares GPT-3.5/4 with open-source 7B variants (e.g., Mistral-7B-Instruct/OpenOrca, Llama-2-7B Chat, Dolphin-2_6-Phi-2) using a 40-sample, few-shot-generated MRC dataset evaluated on exact-match and ROUGE metrics, with CPU-based experiments and varied quantization. Key findings show GPT-4 achieving top performance, while open-source families offer competitive accuracy with favorable deployment characteristics, particularly under resource limits and private-infrastructure requirements; quantization and model size substantially affect accuracy, latency, and memory. The study demonstrates that open-source models can provide viable, cost-effective alternatives for regulated industries, guiding practitioners in balancing accuracy, speed, privacy, and total cost of ownership for industrial MRC deployments.

Abstract

Large Language Models (LLMs) have recently demonstrated remarkable performance in various Natural Language Processing (NLP) applications, such as sentiment analysis, content generation, and personalized recommendations. Despite their impressive capabilities, there remains a significant need for systematic studies concerning the practical application of LLMs in industrial settings, as well as the specific requirements and challenges related to their deployment in these contexts. This need is particularly critical for Machine Reading Comprehension (MCR), where factual, concise, and accurate responses are required. To date, most MCR rely on Small Language Models (SLMs) or Recurrent Neural Networks (RNNs) such as Long Short-Term Memory (LSTM). This trend is evident in the SQuAD2.0 rankings on the Papers with Code table. This article presents a comparative analysis between open-source LLMs and proprietary models on this task, aiming to identify light and open-source alternatives that offer comparable performance to proprietary models.
Paper Structure (10 sections, 2 figures, 1 table)

This paper contains 10 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Overview of our pipeline. The pre-processing stage cleans raw text data, removing non-text elements. The prompting stage builds dynamic prompts with the given user query and document (namely, the context of the query). The post-processing stage formats the LLM's responses and structures them into a dictionary.
  • Figure 2: Evaluated Models