A Survey on Large Language Models with some Insights on their Capabilities and Limitations

Andrea Matarazzo; Riccardo Torlone

A Survey on Large Language Models with some Insights on their Capabilities and Limitations

Andrea Matarazzo, Riccardo Torlone

TL;DR

This survey analyzes the explosion of transformer-based large language models, detailing how scale, data, and architecture shape capabilities and emergent behaviors. It dissects foundational pre-training, data origins, and adaptation techniques such as instruction and alignment tuning, while surveying prominent model families (BERT, T5, GPT, LLaMA, Gemma, Claude) and domain-specialized LLMs in healthcare, finance, education, law, and science. The authors synthesize utilization strategies (in-context learning, chain-of-thought, plan-of-thought, RAG) and planning frameworks (PALMs, SELF-PLANNING, LLM-modulo) and discuss how these enable complex reasoning and multi-step tasks, tempered by concerns about reliability, safety, and bias. The paper emphasizes that emergent abilities arise from scale and data composition, explores mechanisms behind CoT/PoT, and highlights responsible deployment through external tooling and rigorous evaluation. Overall, it provides a structured map of LLM capabilities and limits to guide future research, application, and governance in increasingly complex environments.

Abstract

The rapid advancement of artificial intelligence, particularly with the development of Large Language Models (LLMs) built on the transformer architecture, has redefined the capabilities of natural language processing. These models now exhibit remarkable performance across various language-related tasks, such as text generation, question answering, translation, and summarization, often rivaling human-like comprehension. More intriguingly, LLMs have demonstrated emergent abilities extending beyond their core functions, showing proficiency in tasks like commonsense reasoning, code generation, and arithmetic. This survey paper explores the foundational components, scaling mechanisms, and architectural strategies that drive these capabilities. Emphasizing models like GPT and LLaMA, we analyze the impact of exponential data and computational growth on LLM performance, while also addressing the trade-offs associated with scaling. We also examine LLM applications across sectors, such as healthcare, finance, education, and law, highlighting their adaptability and potential to solve domain-specific challenges. Central to this work are the questions of how LLMs generalize across diverse tasks, exhibit planning, and reasoning abilities, and whether these emergent abilities can be systematically elicited or enhanced. In particular, we provide some insights into the CoT (Chain of Thought) and PoT (Plan of Thought) abilities within LLMs, focusing on how pre-training data influences their emergence. Additionally, we investigate LLM-modulo frameworks that integrate external systems, allowing LLMs to handle complex, dynamic tasks. By analyzing these factors, this paper aims to foster the ongoing discussion on the capabilities and limits of LLMs, promoting their responsible development and application in novel and increasingly complex environments.

A Survey on Large Language Models with some Insights on their Capabilities and Limitations

TL;DR

Abstract

Paper Structure (94 sections, 29 equations, 67 figures, 41 tables)

This paper contains 94 sections, 29 equations, 67 figures, 41 tables.

Introduction
Motivations
Goals of the paper
Content and organization
Large Language Models
Definition and Overview
Scaling Law
Prominent Model Families
BERT
T5
GPT Series
GTP-4
OpenAI o1
Llama
Llama 2.
...and 79 more sections

Figures (67)

Figure 2: Left: scaling law. Model performance increases linearly as the model size increases exponentially. Right: emergent abilities show a phase change at a certain scale where the performance suddenly increases. Source: yaofu2023emergent.
Figure 3: A diagram showing the evolution of publicly available LLMs. Source: survey.
Figure 4: BERT Architecture: The bottom layer contains the embedding representations $E_1, E_2, \ldots E_N$, which encode input tokens and serve as the input to the transformer layers (Trm). Each transformer bidirectionally processes the input embeddings, and the final output is used for downstream tasks. Source: devlin2019bert.
Figure 5: A diagram of the T5 text-to-text framework. Every task -- including translation, question answering, and classification -- is cast as feeding the model text as input and training it to generate some target text. This approach allows the same model, loss function, hyperparameters, etc., to be used across diverse tasks. Source: raffel2023exploring.
Figure 6: o1 greatly improves over GPT-4o on challenging reasoning benchmarks. Solid bars show pass@1 accuracy and the shaded region shows the performance of majority vote (consensus) with 64 samples. Source: openai2024reasoning.
...and 62 more figures

A Survey on Large Language Models with some Insights on their Capabilities and Limitations

TL;DR

Abstract

A Survey on Large Language Models with some Insights on their Capabilities and Limitations

Authors

TL;DR

Abstract

Table of Contents

Figures (67)