Table of Contents
Fetching ...

ListOps: A Diagnostic Dataset for Latent Tree Learning

Nikita Nangia, Samuel R. Bowman

TL;DR

This work introduces ListOps, a toy dataset designed to diagnose the parsing ability of latent-tree learning models by enforcing a single correct parsing strategy. It compares standard sequence models (LSTM) and tree-based models (TreeLSTM) with latent-tree approaches (RL-SPINN, ST-Gumbel), finding that latent-tree methods fail to learn reliable parses and often underperform sequential baselines, even when parsing would be advantageous. The paper demonstrates a clear RNN–TreeRNN capacity gap and positions ListOps as a diagnostic tool to drive development of parsing-aware latent-tree models. The findings emphasize the need for principled parsing mechanisms to improve latent-tree learning and downstream sentence representations.

Abstract

Latent tree learning models learn to parse a sentence without syntactic supervision, and use that parse to build the sentence representation. Existing work on such models has shown that, while they perform well on tasks like sentence classification, they do not learn grammars that conform to any plausible semantic or syntactic formalism (Williams et al., 2018a). Studying the parsing ability of such models in natural language can be challenging due to the inherent complexities of natural language, like having several valid parses for a single sentence. In this paper we introduce ListOps, a toy dataset created to study the parsing ability of latent tree models. ListOps sequences are in the style of prefix arithmetic. The dataset is designed to have a single correct parsing strategy that a system needs to learn to succeed at the task. We show that the current leading latent tree models are unable to learn to parse and succeed at ListOps. These models achieve accuracies worse than purely sequential RNNs.

ListOps: A Diagnostic Dataset for Latent Tree Learning

TL;DR

This work introduces ListOps, a toy dataset designed to diagnose the parsing ability of latent-tree learning models by enforcing a single correct parsing strategy. It compares standard sequence models (LSTM) and tree-based models (TreeLSTM) with latent-tree approaches (RL-SPINN, ST-Gumbel), finding that latent-tree methods fail to learn reliable parses and often underperform sequential baselines, even when parsing would be advantageous. The paper demonstrates a clear RNN–TreeRNN capacity gap and positions ListOps as a diagnostic tool to drive development of parsing-aware latent-tree models. The findings emphasize the need for principled parsing mechanisms to improve latent-tree learning and downstream sentence representations.

Abstract

Latent tree learning models learn to parse a sentence without syntactic supervision, and use that parse to build the sentence representation. Existing work on such models has shown that, while they perform well on tasks like sentence classification, they do not learn grammars that conform to any plausible semantic or syntactic formalism (Williams et al., 2018a). Studying the parsing ability of such models in natural language can be challenging due to the inherent complexities of natural language, like having several valid parses for a single sentence. In this paper we introduce ListOps, a toy dataset created to study the parsing ability of latent tree models. ListOps sequences are in the style of prefix arithmetic. The dataset is designed to have a single correct parsing strategy that a system needs to learn to succeed at the task. We show that the current leading latent tree models are unable to learn to parse and succeed at ListOps. These models achieve accuracies worse than purely sequential RNNs.

Paper Structure

This paper contains 16 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Example of a parsed ListOps sequence. The parse is left-branching within each list, and each constituent is either a partial list, an integer, or the final closing bracket.
  • Figure 2: Left: Parses from RL-SPINN model. Right: Parses from ST-Gumbel model. For the first set of examples in the top row, both each models predict the wrong value (truth: 6, pred: 5). In the second row, RL-SPINN predicts the correct value (truth: 7) while ST-Gumbel does not (pred: 2). In the third row, RL-SPINN predicts the correct value (truth: 6) and generates the same parse as the ground-truth tree; ST-Gumbel predicts the wrong value (pred: 5).
  • Figure 3: Distribution of average tree depth in the ListOps training dataset.
  • Figure 4: Model accuracy on ListOps test set by size of training dataset.