Table of Contents
Fetching ...

Exploring Group and Symmetry Principles in Large Language Models

Shima Imani, Hamid Palangi

TL;DR

This paper introduces a framework grounded in group and symmetry principles, which have played a crucial role in fields such as physics and mathematics, and investigates the performance of these models on four group properties: closure, identity, inverse, and associativity.

Abstract

Large Language Models (LLMs) have demonstrated impressive performance across a wide range of applications; however, assessing their reasoning capabilities remains a significant challenge. In this paper, we introduce a framework grounded in group and symmetry principles, which have played a crucial role in fields such as physics and mathematics, and offer another way to evaluate their capabilities. While the proposed framework is general, to showcase the benefits of employing these properties, we focus on arithmetic reasoning and investigate the performance of these models on four group properties: closure, identity, inverse, and associativity. Our findings reveal that LLMs studied in this work struggle to preserve group properties across different test regimes. In the closure test, we observe biases towards specific outputs and an abrupt degradation in their performance from 100% to 0% after a specific sequence length. They also perform poorly in the identity test, which represents adding irrelevant information in the context, and show sensitivity when subjected to inverse test, which examines the robustness of the model with respect to negation. In addition, we demonstrate that breaking down problems into smaller steps helps LLMs in the associativity test that we have conducted. To support these tests we have developed a synthetic dataset which will be released.

Exploring Group and Symmetry Principles in Large Language Models

TL;DR

This paper introduces a framework grounded in group and symmetry principles, which have played a crucial role in fields such as physics and mathematics, and investigates the performance of these models on four group properties: closure, identity, inverse, and associativity.

Abstract

Large Language Models (LLMs) have demonstrated impressive performance across a wide range of applications; however, assessing their reasoning capabilities remains a significant challenge. In this paper, we introduce a framework grounded in group and symmetry principles, which have played a crucial role in fields such as physics and mathematics, and offer another way to evaluate their capabilities. While the proposed framework is general, to showcase the benefits of employing these properties, we focus on arithmetic reasoning and investigate the performance of these models on four group properties: closure, identity, inverse, and associativity. Our findings reveal that LLMs studied in this work struggle to preserve group properties across different test regimes. In the closure test, we observe biases towards specific outputs and an abrupt degradation in their performance from 100% to 0% after a specific sequence length. They also perform poorly in the identity test, which represents adding irrelevant information in the context, and show sensitivity when subjected to inverse test, which examines the robustness of the model with respect to negation. In addition, we demonstrate that breaking down problems into smaller steps helps LLMs in the associativity test that we have conducted. To support these tests we have developed a synthetic dataset which will be released.
Paper Structure (10 sections, 1 equation, 6 figures, 2 tables)

This paper contains 10 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Closure test: Average accuracy of GPT-4-32k and GPT-3.5 for sums of ones. The x-axis illustrates the varying lengths of expressions composed of summations of repeated ones. The y-axis denotes the accuracy of the two LLMs, GPT-4-32k and GPT-3.5. The color represents the average accuracy obtained from $10$ runs for each test.
  • Figure 2: Number of times GPT-4-32k outputs $100$ (blue) and $50$ (red) compared to ground truth for closure expressions. This visualization emphasizes the biases in the LLMs' responses and offers a deeper insight into their limitations when handling summation tasks.
  • Figure 3: Identity Test. The average accuracy of GPT-4-32k and GPT-3.5 when evaluating sums of ones with varying expression lengths and applying different symmetries. The x-axis represents the expression lengths, while the y-axis indicates the accuracy for GPT-4-32k and GPT-3.5 under various symmetry conditions. The color intensity signifies the average accuracy obtained from $10$ runs for each test.
  • Figure 4: Inverse Test. The average accuracy of GPT-4-32k and GPT-3.5 when evaluating sums of ones and their inverses for various lengths. The x-axis represents the expression lengths, while the y-axis indicates the accuracy for GPT-4-32k and GPT-3.5 under various inverse symmetry conditions. The color intensity signifies the average accuracy obtained from $10$ runs for each test.
  • Figure 5: Associativity Test. The average accuracy of GPT-4-32k and GPT-3.5 for the associativity test for test 1 (top) and test 2 (bottom). The x-axis represents the expression lengths, while the y-axis indicates the accuracy for GPT-4-32k and GPT-3.5. The color intensity signifies the average accuracy obtained from $10$ runs for each test.
  • ...and 1 more figures