Injecting linguistic knowledge into BERT for Dialogue State Tracking

Xiaohan Feng; Xixin Wu; Helen Meng

Injecting linguistic knowledge into BERT for Dialogue State Tracking

Xiaohan Feng, Xixin Wu, Helen Meng

TL;DR

The paper addresses the opacity and data requirements of transformer-based Dialogue State Tracking (DST) by introducing an unsupervised Convex Polytopic Model (CPM) to extract linguistic features that are injected into a BERT-based DST encoder via a lightweight attention modulation. CPM yields interpretable features, including composition coefficients $a_{ij}$ and vertex coordinates, enabling geometric and semantic grounding of the model's decisions. On MultiWoZ, CPM-augmented TripPy achieves statistically significant gains in Joint Goal Accuracy (JGA) on versions 2.3 and 2.4, with ablations confirming the combined benefit of CPM coefficients and attention, and showing that higher CPM dimensionality improves performance. The approach is CPU-efficient (about 5 minutes for CPM on training data) and supports interpretability analyses (Integrated Gradients and vertex-based attribution), offering practical impact for robust and transparent DST systems.

Abstract

Dialogue State Tracking (DST) models often employ intricate neural network architectures, necessitating substantial training data, and their inference process lacks transparency. This paper proposes a method that extracts linguistic knowledge via an unsupervised framework and subsequently utilizes this knowledge to augment BERT's performance and interpretability in DST tasks. The knowledge extraction procedure is computationally economical and does not require annotations or additional training data. The injection of the extracted knowledge can be achieved by the addition of simple neural modules. We employ the Convex Polytopic Model (CPM) as a feature extraction tool for DST tasks and illustrate that the acquired features correlate with syntactic and semantic patterns in the dialogues. This correlation facilitates a comprehensive understanding of the linguistic features influencing the DST model's decision-making process. We benchmark this framework on various DST tasks and observe a notable improvement in accuracy.

Injecting linguistic knowledge into BERT for Dialogue State Tracking

TL;DR

and vertex coordinates, enabling geometric and semantic grounding of the model's decisions. On MultiWoZ, CPM-augmented TripPy achieves statistically significant gains in Joint Goal Accuracy (JGA) on versions 2.3 and 2.4, with ablations confirming the combined benefit of CPM coefficients and attention, and showing that higher CPM dimensionality improves performance. The approach is CPU-efficient (about 5 minutes for CPM on training data) and supports interpretability analyses (Integrated Gradients and vertex-based attribution), offering practical impact for robust and transparent DST systems.

Abstract

Paper Structure (16 sections, 6 equations, 5 figures, 6 tables)

This paper contains 16 sections, 6 equations, 5 figures, 6 tables.

Introduction
Background
Model
Task: Dialogue State Tracking
Base model: TripPy
Knowledge extraction: CPM
Modification to TripPy encoder
Experiments
Datasets
Evaluation
Implementation
Results
Discussion
Effect of Convex Polytope Dimensionality
Influence of CPM features on BERT
...and 1 more sections

Figures (5)

Figure 1: Flowchart of our proposed pipeline.
Figure 2: The 3-D MVS-type polytope. Vertices(Red and in bold) are labelled as V1-V4. The scattered dots denote projected utterance points.
Figure 3: Normalized change of Integrated Gradient (IG) between an input sequence and slot action prediction, calculated as IG on CPM-assisted TripPy subtracted by IG on vanilla TripPy. Only a few slots are displayed due to size constraints. Higher value of IG indicates more positive attribution.
Figure 4: Normalized change of Integrated Gradient (IG) between an input sequence and span prediction for slot restaurant-pricerange on CPM-assisted TripPy, compared to vanilla TripPy. Higher value of IG indicates more positive attribution.
Figure 5: Occurrence of vertex in the list of most important vertices with respect to individual slots, normalized by dividing against total occurrence of individual slots. Higher occurrence percentage indicates more significance.

Injecting linguistic knowledge into BERT for Dialogue State Tracking

TL;DR

Abstract

Injecting linguistic knowledge into BERT for Dialogue State Tracking

Authors

TL;DR

Abstract

Table of Contents

Figures (5)