Table of Contents
Fetching ...

Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers

Nuo Chen, Ning Wu, Shining Liang, Ming Gong, Linjun Shou, Dongmei Zhang, Jia Li

TL;DR

This study probes LLaMA models across scales and layers using carefully designed MC tasks that test calculation, math reasoning, logical inference, truthfulness, and factual knowledge. It finds that internal knowledge and core computational abilities are largely invariant to size, while larger models exhibit significant gains in reasoning and truthfulness once size thresholds are surpassed. Layer-wise results show upper layers dominate computation and factual knowledge, whereas lower layers retain multilingual features and some abstract reasoning, with multilingual capacity strongest in early layers. The cross-lingual experiments (xMPS) further reveal language-specific dynamics across layers and model sizes, offering guidance for architectural and evaluation strategies beyond generation.

Abstract

This paper presents an in-depth analysis of Large Language Models (LLMs), focusing on LLaMA, a prominent open-source foundational model in natural language processing. Instead of assessing LLaMA through its generative output, we design multiple-choice tasks to probe its intrinsic understanding in high-order tasks such as reasoning and computation. We examine the model horizontally, comparing different sizes, and vertically, assessing different layers. We unveil several key and uncommon findings based on the designed probing tasks: (1) Horizontally, enlarging model sizes almost could not automatically impart additional knowledge or computational prowess. Instead, it can enhance reasoning abilities, especially in math problem solving, and helps reduce hallucinations, but only beyond certain size thresholds; (2) In vertical analysis, the lower layers of LLaMA lack substantial arithmetic and factual knowledge, showcasing logical thinking, multilingual and recognitive abilities, with top layers housing most computational power and real-world knowledge.

Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers

TL;DR

This study probes LLaMA models across scales and layers using carefully designed MC tasks that test calculation, math reasoning, logical inference, truthfulness, and factual knowledge. It finds that internal knowledge and core computational abilities are largely invariant to size, while larger models exhibit significant gains in reasoning and truthfulness once size thresholds are surpassed. Layer-wise results show upper layers dominate computation and factual knowledge, whereas lower layers retain multilingual features and some abstract reasoning, with multilingual capacity strongest in early layers. The cross-lingual experiments (xMPS) further reveal language-specific dynamics across layers and model sizes, offering guidance for architectural and evaluation strategies beyond generation.

Abstract

This paper presents an in-depth analysis of Large Language Models (LLMs), focusing on LLaMA, a prominent open-source foundational model in natural language processing. Instead of assessing LLaMA through its generative output, we design multiple-choice tasks to probe its intrinsic understanding in high-order tasks such as reasoning and computation. We examine the model horizontally, comparing different sizes, and vertically, assessing different layers. We unveil several key and uncommon findings based on the designed probing tasks: (1) Horizontally, enlarging model sizes almost could not automatically impart additional knowledge or computational prowess. Instead, it can enhance reasoning abilities, especially in math problem solving, and helps reduce hallucinations, but only beyond certain size thresholds; (2) In vertical analysis, the lower layers of LLaMA lack substantial arithmetic and factual knowledge, showcasing logical thinking, multilingual and recognitive abilities, with top layers housing most computational power and real-world knowledge.
Paper Structure (25 sections, 1 equation, 8 figures, 8 tables)

This paper contains 25 sections, 1 equation, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Overall Comparison with LLaMA 2 7B-70B in our probing tasks. Detailed introduction of each task include in Section \ref{['sec: probing']}. Dashed lines represent the first layer of each model, while solid lines represent the last layer of the model.
  • Figure 2: Overall comparison between LLaMA 2 7B to 70B dealing with different reasoning steps problems in our probing MPS-Rea tasks.
  • Figure 3: Overall comparison between LLaMA 2 7B to 70B dealing with 5–6 bit calculations in our probing arithmetic tasks. We present more detailed results of 1-2 and 3-4 bit calculations in the Appendix \ref{['sec:appendix']}, Figure \ref{['fig:all_cal']}.
  • Figure 4: Overall Comparison with LLaMA 2-7B and 70B in our probing tasks. We include all layers' performances of each size model in the Appendix \ref{['sec:appendix']}, Table \ref{['table:all_results_7b']}, \ref{['table:all_results_13b']} and \ref{['table:all_results_70b']}.
  • Figure 5: Overall Comparison with LLaMA 2-7B to 70B in our xMPS-Rea probing tasks. ES, FR, ZH and TH refer to Spanish, French, Chinese and Thai.
  • ...and 3 more figures