Table of Contents
Fetching ...

Towards Robustness of Text-to-Visualization Translation against Lexical and Phrasal Variability

Jinwei Lu, Yuanfeng Song, Haodi Zhang, Chen Zhang, Raymond Chi-Wing Wong

TL;DR

This work evaluates the robustness of text-to-visualization models to lexical and phrasal variations by introducing nvBench-Rob, a dataset comprising NLQ and data-schema perturbations across three test sets. It reveals that state-of-the-art models exhibit substantial performance drops under robustness scenarios, underscoring inadequate resilience to input variations. To address this, the authors propose GRED, a Retrieval-Augmented Generation framework with three components—NLQ-Retrieval Generator, DVQ-Retrieval Retuner, and Annotation-based Debugger—that collaboratively handle NLQ variants, programming-style differences, and schema variations. Across extensive experiments, GRED outperforms baselines by large margins on nvBench-Rob, demonstrating improved robustness and offering a practical path toward reliable text-to-vis systems in real-world settings.

Abstract

Text-to-Vis is an emerging task in the natural language processing (NLP) area that aims to automatically generate data visualizations from natural language questions (NLQs). Despite their progress, existing text-to-vis models often heavily rely on lexical matching between words in the questions and tokens in data schemas. This overreliance on lexical matching may lead to a diminished level of model robustness against input variations. In this study, we thoroughly examine the robustness of current text-to-vis models, an area that has not previously been explored. In particular, we construct the first robustness dataset nvBench-Rob, which contains diverse lexical and phrasal variations based on the original text-to-vis benchmark nvBench. Then, we found that the performance of existing text-to-vis models on this new dataset dramatically drops, implying that these methods exhibit inadequate robustness overall. Finally, we propose a novel framework based on Retrieval-Augmented Generation (RAG) technique, named GRED, specifically designed to address input perturbations in these two variants. The framework consists of three parts: NLQ-Retrieval Generator, Visualization Query-Retrieval Retuner and Annotation-based Debugger, which are used to tackle the challenges posed by natural language variants, programming style differences and data schema variants, respectively. Extensive experimental evaluations show that, compared to the state-of-the-art model RGVisNet in the Text-to-Vis field, GRED performs better in terms of model robustness, with a 32% increase in accuracy on the proposed nvBench-Rob dataset.

Towards Robustness of Text-to-Visualization Translation against Lexical and Phrasal Variability

TL;DR

This work evaluates the robustness of text-to-visualization models to lexical and phrasal variations by introducing nvBench-Rob, a dataset comprising NLQ and data-schema perturbations across three test sets. It reveals that state-of-the-art models exhibit substantial performance drops under robustness scenarios, underscoring inadequate resilience to input variations. To address this, the authors propose GRED, a Retrieval-Augmented Generation framework with three components—NLQ-Retrieval Generator, DVQ-Retrieval Retuner, and Annotation-based Debugger—that collaboratively handle NLQ variants, programming-style differences, and schema variations. Across extensive experiments, GRED outperforms baselines by large margins on nvBench-Rob, demonstrating improved robustness and offering a practical path toward reliable text-to-vis systems in real-world settings.

Abstract

Text-to-Vis is an emerging task in the natural language processing (NLP) area that aims to automatically generate data visualizations from natural language questions (NLQs). Despite their progress, existing text-to-vis models often heavily rely on lexical matching between words in the questions and tokens in data schemas. This overreliance on lexical matching may lead to a diminished level of model robustness against input variations. In this study, we thoroughly examine the robustness of current text-to-vis models, an area that has not previously been explored. In particular, we construct the first robustness dataset nvBench-Rob, which contains diverse lexical and phrasal variations based on the original text-to-vis benchmark nvBench. Then, we found that the performance of existing text-to-vis models on this new dataset dramatically drops, implying that these methods exhibit inadequate robustness overall. Finally, we propose a novel framework based on Retrieval-Augmented Generation (RAG) technique, named GRED, specifically designed to address input perturbations in these two variants. The framework consists of three parts: NLQ-Retrieval Generator, Visualization Query-Retrieval Retuner and Annotation-based Debugger, which are used to tackle the challenges posed by natural language variants, programming style differences and data schema variants, respectively. Extensive experimental evaluations show that, compared to the state-of-the-art model RGVisNet in the Text-to-Vis field, GRED performs better in terms of model robustness, with a 32% increase in accuracy on the proposed nvBench-Rob dataset.
Paper Structure (36 sections, 5 figures, 5 tables)

This paper contains 36 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (a) Text-to-vis is dedicated to converting natural language questions (NLQs) into data visualizations (DVs). The current approach heavily relies on explicit matching between words within the NLQs and the table schema. (b) The robustness of existing text-to-vis methods is limited. When small variations in NLQs and table schemas appear, the text-to-vis model fails to generate correct outputs (marked with '$\times$' in red color).
  • Figure 2: Statistics of the nvBench-Rob Dataset
  • Figure 3: The performance of existing text-to-vis models dramatically drops on the nvBench-Rob datasets.
  • Figure 4: The working pipeline of our proposed GRED method, which includes three steps: (a) Input the NLQ into the Retriever to obtain the top-$K$ (DB, NLQ, Schemas) instances, then input these instances along with the NLQ and Schemas into the NLQ-Retrieval Generator to get DVQ_Rtn; (b) Input the DVQ_Rtn into the Retriever to obtain the top-$K$ DVQs, referred to as Reference DVQs, then input Reference DVQs along with DVQ_Rtn into the DVQ-Retrieval Retuner to get DVQ_Rtn; (c) Input the DVQ_Rtn and the annotated databases corresponding to Schemas into the Annotation-based Debugger to obtain the final result DVQ_Dbg.
  • Figure 5: Robustness Analysis Cases