DNCs Require More Planning Steps

Yara Shamshoum; Nitzan Hodos; Yuval Sieradzki; Assaf Schuster

DNCs Require More Planning Steps

Yara Shamshoum, Nitzan Hodos, Yuval Sieradzki, Assaf Schuster

TL;DR

This work investigates how computational time and external memory constraints influence the generalization of differentiable algorithmic solvers, using the Differentiable Neural Computer (DNC) as a testbed. It introduces a planning-budget framework with $p(n)$ and memory-size considerations, revealing that a fixed small budget hinders generalization, while adaptive planning and memory strategies improve performance across tasks such as Shortest Path, MinCut, Convex Hull, and Associative Recall. Key contributions include memory-extension techniques coupled with temperature-based reweighting, adaptive memory $m(n)$, and a stochastic planning budget during training, all supported by empirical results showing phase transitions in A_n(p) and improved stability. The findings provide general guidelines for designing resource-aware algorithmic solvers, with potential implications for more advanced models and LLMs that must balance time and memory in real-world settings.

Abstract

Many recent works use machine learning models to solve various complex algorithmic problems. However, these models attempt to reach a solution without considering the problem's required computational complexity, which can be detrimental to their ability to solve it correctly. In this work we investigate the effect of computational time and memory on generalization of implicit algorithmic solvers. To do so, we focus on the Differentiable Neural Computer (DNC), a general problem solver that also lets us reason directly about its usage of time and memory. In this work, we argue that the number of planning steps the model is allowed to take, which we call "planning budget", is a constraint that can cause the model to generalize poorly and hurt its ability to fully utilize its external memory. We evaluate our method on Graph Shortest Path, Convex Hull, Graph MinCut and Associative Recall, and show how the planning budget can drastically change the behavior of the learned algorithm, in terms of learned time complexity, training time, stability and generalization to inputs larger than those seen during training.

DNCs Require More Planning Steps

TL;DR

and memory-size considerations, revealing that a fixed small budget hinders generalization, while adaptive planning and memory strategies improve performance across tasks such as Shortest Path, MinCut, Convex Hull, and Associative Recall. Key contributions include memory-extension techniques coupled with temperature-based reweighting, adaptive memory

, and a stochastic planning budget during training, all supported by empirical results showing phase transitions in A_n(p) and improved stability. The findings provide general guidelines for designing resource-aware algorithmic solvers, with potential implications for more advanced models and LLMs that must balance time and memory in real-world settings.

Abstract

Paper Structure (27 sections, 4 equations, 14 figures, 3 tables)

This paper contains 27 sections, 4 equations, 14 figures, 3 tables.

Introduction
Our Contributions
Related Work
Memory Augmented Neural Networks
Adaptive Computation Time
DNC Recap
Method
Motivation
Generalization with DNC
Experiments
Training
Memory Extension for Generalization
Planning Budget Affects Generalization
Empirically Determined Planning Budget
Stochastic Planning Budget During Training
...and 12 more sections

Figures (14)

Figure 1: An Example of a DNC Forward Pass on an Input of the Shortest Path Task. The DNC maintains read (orange) and write (blue) distributions over a memory with $N$ cells. The process begins with the description phase, where the model receives the input, in this case graph edges, and writes them to memory. Then in the query phase the model is given the source and target nodes $(s,t)$, written to memory as well. Next, during the planning phase, the model does not receive any new external input, but can access and update its memory. Finally, in the answer phase, the model outputs the edges that form the calculated shortest path. Decoding the read distribution during the planning phase, can provide insight to how the model traverses the graph in order to find the shortest path. By using the write distribution from the description phase, we can infer where each edge is saved in memory. This allows us to plot the read distribution over these locations during the planning phase on the graph itself, visualizing how the model locates its target.
Figure 2: Effect of Different Memory Extension Techniques on Generalization - Evaluated on Graph Shortest Path task with $p(n)=n$. Graphs seen during training have at most 75 edges, marked in red. The memory size used for training is 200 cells, marked in black. A performance drop occurs around the original memory size of $m=200$ when attempting to generalize without memory extension. Extending the memory five times to $m=1000$ results in near-zero accuracy on all input sizes. Introducing our reweighting technique with $\tau = 0.65$ enables generalization to much larger inputs. Finally, using an adaptive memory during inference allows generalization while maintaining high accuracy on smaller inputs too.
Figure 3: Effect of Memory Reweighting on the Strength Scalar $\beta$ - Evaluated on Graph Shortest Path with $p(n)=n$. In DNC, read and write operations are smooth and as a result add noise to the memory, an effect that is more prominent when the memory is extended. As $\beta$ attempts to calibrate the smoothness of the similarity scores between the key and the memory cells, we expect that the same $\beta$ value will be optimal when using an input that is 5 times larger within a memory that is 5 times larger, as the same ratio of noise values gets into the similarity score. When applying our technique to (a) a small input and (b) a large input, the temperature reweighting recalibrates $\beta$ to be optimal for the memory used during training and the noise ratio determined by the input sizes seen during training. Hence, large inputs within the extended memory will gain performance as this ratio is matched, while for small inputs this will cause degradation in performance. We also notice how the temperature reweighting drastically reduces the standard deviation of $\beta$, which is expected as it sharpens the distribution, making it more certain.
Figure 4: Effect of Planning Budget on Generalization and Training Efficiency - Generalization of various planning budgets on (a) Shortest Path Task and (b) Convex Hull Task demonstrates improvement with some larger budgets as well as the adaptive budget. The largest training sample is marked in red. In both tasks, the model's generalization improves over the baseline of the previously used planning budget $p(n)=10$. Subfigure (c) illustrates the estimated number of FLOPs for each planning budget on the Shortest Path task. The accuracy is evaluated on inputs twice the size of the largest training sample. Notably, the training of the model with the adaptive budget proves to be as efficient as the model trained with the smallest constant budget, while outperforming the model with the largest planning budget. A performance drop for very small inputs in (a) and (b) is observed, which we attribute to the training using curriculum learning. Towards the end of their training, the focus shifts to larger training samples, potentially leading to forgetting the easier ones.
Figure 5: Effect of Planning Budget on Generalization for Associative Recall Task - The baseline of $p=10$ generalizes well, and the model does not benefit from the larger planning budgets. This aligns with our expectations, considering the simplicity of the task, which can be efficiently solved online.
...and 9 more figures

DNCs Require More Planning Steps

TL;DR

Abstract

DNCs Require More Planning Steps

Authors

TL;DR

Abstract

Table of Contents

Figures (14)