Table of Contents
Fetching ...

MASSIVE Multilingual Abstract Meaning Representation: A Dataset and Baselines for Hallucination Detection

Michael Regan, Shira Wein, George Baker, Emilio Monti

TL;DR

MASSIVE-AMR introduces the largest multilingual AMR QA dataset, pairing 84,000 text-to-graph annotations for 1,685 information-seeking utterances across 50+ languages, enabling evaluation of multilingual AMR parsing, SPARQL generation, and SPARQL-hallucination detection for KBQA. The authors detail corpus creation with localized entities, annotation guidelines, and inter-annotator agreement, and demonstrate the utility of AMR/SPARQL in detecting hallucinations and calibrating QA systems. Through in-context learning and fine-tuning experiments on AMR and SPARQL parsing across language subsets, the work shows SPARQL parsing can achieve high executability while AMR parsing lags behind engineered baselines, highlighting persistent challenges for LLM-based structured parsing. They also reveal that easy and hard hallucination detection via joint AMR-SPARQL is difficult for current models, even with GPT-4, underscoring the need for robust evaluation metrics and larger, more diverse multilingual data. By releasing MASSIVE-AMR, the paper provides a valuable resource for advancing multilingual structured QA, model interpretability, and methods to mitigate hallucinations in knowledge-base querying.

Abstract

Abstract Meaning Representation (AMR) is a semantic formalism that captures the core meaning of an utterance. There has been substantial work developing AMR corpora in English and more recently across languages, though the limited size of existing datasets and the cost of collecting more annotations are prohibitive. With both engineering and scientific questions in mind, we introduce MASSIVE-AMR, a dataset with more than 84,000 text-to-graph annotations, currently the largest and most diverse of its kind: AMR graphs for 1,685 information-seeking utterances mapped to 50+ typologically diverse languages. We describe how we built our resource and its unique features before reporting on experiments using large language models for multilingual AMR and SPARQL parsing as well as applying AMRs for hallucination detection in the context of knowledge base question answering, with results shedding light on persistent issues using LLMs for structured parsing.

MASSIVE Multilingual Abstract Meaning Representation: A Dataset and Baselines for Hallucination Detection

TL;DR

MASSIVE-AMR introduces the largest multilingual AMR QA dataset, pairing 84,000 text-to-graph annotations for 1,685 information-seeking utterances across 50+ languages, enabling evaluation of multilingual AMR parsing, SPARQL generation, and SPARQL-hallucination detection for KBQA. The authors detail corpus creation with localized entities, annotation guidelines, and inter-annotator agreement, and demonstrate the utility of AMR/SPARQL in detecting hallucinations and calibrating QA systems. Through in-context learning and fine-tuning experiments on AMR and SPARQL parsing across language subsets, the work shows SPARQL parsing can achieve high executability while AMR parsing lags behind engineered baselines, highlighting persistent challenges for LLM-based structured parsing. They also reveal that easy and hard hallucination detection via joint AMR-SPARQL is difficult for current models, even with GPT-4, underscoring the need for robust evaluation metrics and larger, more diverse multilingual data. By releasing MASSIVE-AMR, the paper provides a valuable resource for advancing multilingual structured QA, model interpretability, and methods to mitigate hallucinations in knowledge-base querying.

Abstract

Abstract Meaning Representation (AMR) is a semantic formalism that captures the core meaning of an utterance. There has been substantial work developing AMR corpora in English and more recently across languages, though the limited size of existing datasets and the cost of collecting more annotations are prohibitive. With both engineering and scientific questions in mind, we introduce MASSIVE-AMR, a dataset with more than 84,000 text-to-graph annotations, currently the largest and most diverse of its kind: AMR graphs for 1,685 information-seeking utterances mapped to 50+ typologically diverse languages. We describe how we built our resource and its unique features before reporting on experiments using large language models for multilingual AMR and SPARQL parsing as well as applying AMRs for hallucination detection in the context of knowledge base question answering, with results shedding light on persistent issues using LLMs for structured parsing.
Paper Structure (31 sections, 1 equation, 3 figures, 13 tables)

This paper contains 31 sections, 1 equation, 3 figures, 13 tables.

Figures (3)

  • Figure 1: As a proxy for QA correctness, we test a joint AMR-SPARQL model, controlling for semantic relations (in bold). Given an utterance like Who created Iron Man?, a model outputs a N-best list of candidates of mixed representation types. When the relation creator is allowed (top), we expect the model to rank SPARQL higher than AMR. If the we change the ontology, the AMR may rank higher (middle), suggesting an ambiguity exists (creator$\approx$author). Models also produce non-existent relations (bottom), detected via ranking or a look-up operation.
  • Figure 2: Example prompt for SPARQL parsing with generation completion and associated features. Our controlled setting for hallucination detection is then reduced to verifying all relations in a parsed query are in the given list, verification the model outputs along with the parsed sparql_query. For considerations of space, we show only 3 (of 140) relations, the allowed_relation_list (second system message in prompt).
  • Figure 3: Examples of SPARQL parsing using GPT-3.5 showing hallucinations and hallucination detection.