Table of Contents
Fetching ...

Incremental and Data-Efficient Concept Formation to Support Masked Word Prediction

Xin Lian, Nishant Baglodi, Christopher J. MacLellan

TL;DR

Cobweb4L uses the information theoretic variant of category utility and a new performance mechanism that leverages multiple concepts to generate predictions to significantly outperforms prior Cobweb performance mechanisms that use only a single node to generate predictions.

Abstract

This paper introduces Cobweb4L, a novel approach for efficient language model learning that supports masked word prediction. The approach builds on Cobweb, an incremental system that learns a hierarchy of probabilistic concepts. Each concept stores the frequencies of words that appear in instances tagged with that concept label. The system utilizes an attribute value representation to encode words and their surrounding context into instances. Cobweb4L uses the information theoretic variant of category utility and a new performance mechanism that leverages multiple concepts to generate predictions. We demonstrate that with these extensions it significantly outperforms prior Cobweb performance mechanisms that use only a single node to generate predictions. Further, we demonstrate that Cobweb4L learns rapidly and achieves performance comparable to and even superior to Word2Vec. Next, we show that Cobweb4L and Word2Vec outperform BERT in the same task with less training data. Finally, we discuss future work to make our conclusions more robust and inclusive.

Incremental and Data-Efficient Concept Formation to Support Masked Word Prediction

TL;DR

Cobweb4L uses the information theoretic variant of category utility and a new performance mechanism that leverages multiple concepts to generate predictions to significantly outperforms prior Cobweb performance mechanisms that use only a single node to generate predictions.

Abstract

This paper introduces Cobweb4L, a novel approach for efficient language model learning that supports masked word prediction. The approach builds on Cobweb, an incremental system that learns a hierarchy of probabilistic concepts. Each concept stores the frequencies of words that appear in instances tagged with that concept label. The system utilizes an attribute value representation to encode words and their surrounding context into instances. Cobweb4L uses the information theoretic variant of category utility and a new performance mechanism that leverages multiple concepts to generate predictions. We demonstrate that with these extensions it significantly outperforms prior Cobweb performance mechanisms that use only a single node to generate predictions. Further, we demonstrate that Cobweb4L learns rapidly and achieves performance comparable to and even superior to Word2Vec. Next, we show that Cobweb4L and Word2Vec outperform BERT in the same task with less training data. Finally, we discuss future work to make our conclusions more robust and inclusive.
Paper Structure (12 sections, 6 equations, 6 figures, 1 table)

This paper contains 12 sections, 6 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: The illustration of the learning and prediction processes of Cobweb. Figure (a) shows how Cobweb incorporates a new instance into its tree from its root to a leaf node, and Figure (b) shows how Cobweb classifies an instance with its unobserved attribute ( number) from its root to a leaf node with a similar process as during learning. To predict the unobserved attribute, Cobweb uses a specific node along the categorization path, and the two most prevalent choices are the leaf and basic-level nodes.
  • Figure 2: An example of a Cobweb/4L instance derived from the text " The curate and the others thanked him and added their entreaties." Suppose we want to predict the anchor word thank(ed) given the context window 3. The anchor word and the context words before ( and the others) and after ( him and added) are stored in the instance. Each word is stored with a weighted count that reflects the relative position from the anchor. The word has more weight if it is closer to the anchor.
  • Figure 3: The process of how Cobweb/4L learns an additional new instance after learning 6 instances. The process is indeed the same as the one in Cobweb (but here in particular, the information-theoretic category utility is used for evaluating the operation for each traversed concept node).
  • Figure 4: The illustration of how Cobweb/4L classifies and predicts the word of an unobserved attribute of an instance. Starting from the root, Cobweb/4L recursively finds the concept node that has the greatest collocation score among the search frontier (green nodes), and after finding the best node $c^*$, it adds $c^*$ to the collection of expanded nodes (red nodes), extends the search frontier to its children (so its children turn green), and continues to find the next best concept node, until the number of expanded nodes reaches the number of maximum expanded nodes $N_{max}=3$. After that, Cobweb/4L calculates the predicted probability for a certain word under the unobserved attribute by combining the predicted probabilities from all expanded nodes weighted by their collocation scores.
  • Figure 5: Cobweb/4L's prediction accuracy on the MSR Sentence Completion Challenge zweig2011microsoft data after training approximately one-third of the Sherlock Holmes Stories data with varying maximum number of expanded nodes in prediction $N_{max}$.
  • ...and 1 more figures