Table of Contents
Fetching ...

Words or Characters? Fine-grained Gating for Reading Comprehension

Zhilin Yang, Bhuwan Dhingra, Ye Yuan, Junjie Hu, William W. Cohen, Ruslan Salakhutdinov

TL;DR

This paper tackles the challenge of effectively combining word- and character-level token representations for reading comprehension. It introduces a fine-grained, token-property conditioned gating mechanism and extends the idea to model document-query interactions via a gated attention framework. The approach yields substantial improvements across datasets, achieving state-of-the-art results on the Children's Book Test and strong performance on SQuAD and Who Did What, while also showing gains on Twitter tagging. Visualization confirms intuitive gating behavior: rare or morphologically rich tokens leverage character-level information, while frequent tokens rely more on word-level representations, indicating robust, interpretable dynamics with practical impact for high-level NLP tasks.

Abstract

Previous work combines word-level and character-level representations using concatenation or scalar weighting, which is suboptimal for high-level tasks like reading comprehension. We present a fine-grained gating mechanism to dynamically combine word-level and character-level representations based on properties of the words. We also extend the idea of fine-grained gating to modeling the interaction between questions and paragraphs for reading comprehension. Experiments show that our approach can improve the performance on reading comprehension tasks, achieving new state-of-the-art results on the Children's Book Test dataset. To demonstrate the generality of our gating mechanism, we also show improved results on a social media tag prediction task.

Words or Characters? Fine-grained Gating for Reading Comprehension

TL;DR

This paper tackles the challenge of effectively combining word- and character-level token representations for reading comprehension. It introduces a fine-grained, token-property conditioned gating mechanism and extends the idea to model document-query interactions via a gated attention framework. The approach yields substantial improvements across datasets, achieving state-of-the-art results on the Children's Book Test and strong performance on SQuAD and Who Did What, while also showing gains on Twitter tagging. Visualization confirms intuitive gating behavior: rare or morphologically rich tokens leverage character-level information, while frequent tokens rely more on word-level representations, indicating robust, interpretable dynamics with practical impact for high-level NLP tasks.

Abstract

Previous work combines word-level and character-level representations using concatenation or scalar weighting, which is suboptimal for high-level tasks like reading comprehension. We present a fine-grained gating mechanism to dynamically combine word-level and character-level representations based on properties of the words. We also extend the idea of fine-grained gating to modeling the interaction between questions and paragraphs for reading comprehension. Experiments show that our approach can improve the performance on reading comprehension tasks, achieving new state-of-the-art results on the Children's Book Test dataset. To demonstrate the generality of our gating mechanism, we also show improved results on a social media tag prediction task.

Paper Structure

This paper contains 13 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Word-character fine-grained gating. The two lookup tables are shared. "NER", "POS", "frequency" refer to named entity tags, part-of-speech tags, document frequency features.
  • Figure 2: Paragraph-question fine-grained gating.
  • Figure 3: Visualization of the weight matrix $\mathbf{W}_g$. Weights for each features are averaged. Red means high and yellow means low. High weight values favor character-level representations, and low weight values favor word-level representations. "Organization", "'Person", "Location", and "O" are named entity tags; "DOCLEN-n" are document frequency features (larger $n$ means higher frequency, $n$ from 0 to 4); others are POS tags.
  • Figure 4: Visualization of gate values in the text. Red means high and yellow means low. High gate values favor character-level representations, and low gate values favor word-level representations.