Table of Contents
Fetching ...

Analyzing the Role of Semantic Representations in the Era of Large Language Models

Zhijing Jin, Yuen Chen, Fernando Gonzalez, Jiarui Liu, Jiayi Zhang, Julian Michael, Bernhard Schölkopf, Mona Diab

TL;DR

This paper asks whether traditional semantic representations, exemplified by Abstract Meaning Representation (AMR), retain value in the era of fixed-weights large language models. It introduces AmrCoT, an AMR-driven prompt that prepends AMR to input text for zero-shot tasks, and evaluates it across five diverse NLP tasks with multiple GPT-family models. The results show AMR yields only modest, task-dependent changes and often hurts performance, though it helps a subset of samples, especially in semantically complex cases. Through case studies, large-scale feature analyses, and ablations (including gold vs parser AMR and step-by-step reasoning checks), the work reveals systematic weaknesses in AMR for MWEs and named entities, while confirming that raw text remains a more influential intermediate representation for current LLMs. The study highlights the need to improve how LLMs map symbolic representations like AMR to outputs, and suggests future directions including training LLMs specifically for AMR use and refining prompts to better exploit semantic structures.

Abstract

Traditionally, natural language processing (NLP) models often use a rich set of features created by linguistic expertise, such as semantic representations. However, in the era of large language models (LLMs), more and more tasks are turned into generic, end-to-end sequence generation problems. In this paper, we investigate the question: what is the role of semantic representations in the era of LLMs? Specifically, we investigate the effect of Abstract Meaning Representation (AMR) across five diverse NLP tasks. We propose an AMR-driven chain-of-thought prompting method, which we call AMRCoT, and find that it generally hurts performance more than it helps. To investigate what AMR may have to offer on these tasks, we conduct a series of analysis experiments. We find that it is difficult to predict which input examples AMR may help or hurt on, but errors tend to arise with multi-word expressions, named entities, and in the final inference step where the LLM must connect its reasoning over the AMR to its prediction. We recommend focusing on these areas for future work in semantic representations for LLMs. Our code: https://github.com/causalNLP/amr_llm.

Analyzing the Role of Semantic Representations in the Era of Large Language Models

TL;DR

This paper asks whether traditional semantic representations, exemplified by Abstract Meaning Representation (AMR), retain value in the era of fixed-weights large language models. It introduces AmrCoT, an AMR-driven prompt that prepends AMR to input text for zero-shot tasks, and evaluates it across five diverse NLP tasks with multiple GPT-family models. The results show AMR yields only modest, task-dependent changes and often hurts performance, though it helps a subset of samples, especially in semantically complex cases. Through case studies, large-scale feature analyses, and ablations (including gold vs parser AMR and step-by-step reasoning checks), the work reveals systematic weaknesses in AMR for MWEs and named entities, while confirming that raw text remains a more influential intermediate representation for current LLMs. The study highlights the need to improve how LLMs map symbolic representations like AMR to outputs, and suggests future directions including training LLMs specifically for AMR use and refining prompts to better exploit semantic structures.

Abstract

Traditionally, natural language processing (NLP) models often use a rich set of features created by linguistic expertise, such as semantic representations. However, in the era of large language models (LLMs), more and more tasks are turned into generic, end-to-end sequence generation problems. In this paper, we investigate the question: what is the role of semantic representations in the era of LLMs? Specifically, we investigate the effect of Abstract Meaning Representation (AMR) across five diverse NLP tasks. We propose an AMR-driven chain-of-thought prompting method, which we call AMRCoT, and find that it generally hurts performance more than it helps. To investigate what AMR may have to offer on these tasks, we conduct a series of analysis experiments. We find that it is difficult to predict which input examples AMR may help or hurt on, but errors tend to arise with multi-word expressions, named entities, and in the final inference step where the LLM must connect its reasoning over the AMR to its prediction. We recommend focusing on these areas for future work in semantic representations for LLMs. Our code: https://github.com/causalNLP/amr_llm.
Paper Structure (48 sections, 5 equations, 5 figures, 15 tables)

This paper contains 48 sections, 5 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: The role of representation power in different fields. Analogous to Arabic numbers for math, AMR is designed to efficiently and explicitly represent the semantic features of text. Existing work using AMR is concerned with trainable models, whereas we investigate the use of AMR in the modern practical setup of pre-trained LLMs.
  • Figure 2: Performance of Base (in purple) and AmrCoT (in red) on 5 datasets across 5 model versions: text-davinci-001|-002|-003, GPT-3.5 and GPT-4.
  • Figure 3: An example showing the failure of AMR for paraphrase detection when the original sentence involves a MWE. This example is from our GoldSlang-ComposedAMR dataset.
  • Figure 4: Ablation studies of AMR and text representations in the prompt on the AMR-NER dataset using GPT-4. Starting from the AmrCoT prompt with the complete text and AMR, we randomly drop out a certain portion of tokens in the text/AMR, and see the effect on the task performance.
  • Figure 5: Ablation studies of AMR and text representations in the prompt on 1,000 random samples of the WMT dataset using GPT-4. Starting from the AmrCoT prompt with the complete text and AMR, we randomly drop out a certain portion of tokens in the text/AMR, and see the effect on the task performance.