Unsupervised Learning of Graph from Recipes
Aissatou Diallo, Antonis Bikakis, Luke Dickens, Anthony Hunter, Rob Miller
TL;DR
This work tackles unsupervised procedural understanding by converting cooking recipes into graphs that encode actions, ingredients, and locations to enable reasoning about sequences. It introduces a self-supervised pipeline with a text-to-graph component (Entity Identifier and Graph Structure Encoder) and a graph-to-text component (Transformer-based Decoder), trained via decoding graphs back into text and optimizing a joint loss $\mathcal{L}_{tot}= \mathcal{L}_{gse} + \mathcal{L}_{gen} + \lambda\|A\|_1$. A key innovation is the continuous relaxation of adjacency via a Sinkhorn-based sparsification, producing sparse, discrete-like graphs while learning node embeddings from a cooking-domain prior. The approach also includes a Recurrent Graph Embedding to capture temporal progression, enabling the model to build graphs incrementally as the recipe unfolds. Empirical results on Now You're Cooking and the English Flow Corpus demonstrate strong entity identification and competitive text↔graph performance, highlighting the potential of unsupervised graph learning for procedural knowledge extraction and reasoning in automated agents.
Abstract
Cooking recipes are one of the most readily available kinds of procedural text. They consist of natural language instructions that can be challenging to interpret. In this paper, we propose a model to identify relevant information from recipes and generate a graph to represent the sequence of actions in the recipe. In contrast with other approaches, we use an unsupervised approach. We iteratively learn the graph structure and the parameters of a $\mathsf{GNN}$ encoding the texts (text-to-graph) one sequence at a time while providing the supervision by decoding the graph into text (graph-to-text) and comparing the generated text to the input. We evaluate the approach by comparing the identified entities with annotated datasets, comparing the difference between the input and output texts, and comparing our generated graphs with those generated by state of the art methods.
