Learning Action Conditions from Instructional Manuals for Instruction Understanding

Te-Lin Wu; Caiqi Zhang; Qingyuan Hu; Alex Spangher; Nanyun Peng

Learning Action Conditions from Instructional Manuals for Instruction Understanding

Te-Lin Wu, Caiqi Zhang, Qingyuan Hu, Alex Spangher, Nanyun Peng

TL;DR

The paper defines and tackles action-condition inference in real-world instructional manuals, highlighting the need to extract preconditions and postconditions to support autonomous and assistive task execution. It builds a densely annotated evaluation dataset from WikiHow and Instructables, and proposes a weakly supervised learning approach that combines linguistic heuristics (entity tracing, keywords, temporal reasoning) with two transformer-based model variants (non-contextualized and contextualized). The study shows that leveraging full instruction context yields substantial improvements over context-free baselines, and that the designed heuristics plus self-training provide additional gains in low-resource settings, though human performance remains higher by a sizable margin. These findings advance procedural text understanding and offer practical resources and directions for end-to-end action-condition extraction and integration of external knowledge to improve real-world instruction comprehension.

Abstract

The ability to infer pre- and postconditions of an action is vital for comprehending complex instructions, and is essential for applications such as autonomous instruction-guided agents and assistive AI that supports humans to perform physical tasks. In this work, we propose a task dubbed action condition inference, and collecting a high-quality, human annotated dataset of preconditions and postconditions of actions in instructional manuals. We propose a weakly supervised approach to automatically construct large-scale training instances from online instructional manuals, and curate a densely human-annotated and validated dataset to study how well the current NLP models can infer action-condition dependencies in the instruction texts. We design two types of models differ by whether contextualized and global information is leveraged, as well as various combinations of heuristics to construct the weak supervisions. Our experimental results show a >20% F1-score improvement with considering the entire instruction contexts and a >6% F1-score benefit with the proposed heuristics.

Learning Action Conditions from Instructional Manuals for Instruction Understanding

TL;DR

Abstract

Paper Structure (42 sections, 4 figures, 15 tables)

This paper contains 42 sections, 4 figures, 15 tables.

Introduction
Terminologies and Problem Definition
Datasets and Human Annotations
Annotations and Task Specifications
Training With Weak Supervision
Linking Heuristics
Keywords
Key Entity Tracing
Linking Algorithm
Incorporating Temporal Relations
Labelling The Linkages
Models
Non-Contextualized Model
Contextualized Model
Learning
...and 27 more sections

Figures (4)

Figure 1: The Action Condition Inference Task: We propose a task that probes models' ability to infer both preconditions and postconditions of an action from instructional manuals. It has wide applications to e.g. assistive AI and task-solving robots. $^{*}$This instruction is simplified for illustration.
Figure 2: Terminologies:(Left) shows a few exemplaractionableswith their associatedpreconditionsandpostconditions. Notice that an actionable can have multiple pre- or postconditions and they can span across different instruction steps (for simplicity we do not show an exhausted set of text segments, and the actual instruction contexts are much longer). (Right) SRL is used to postulate the text segments (actionables and conditions). We show a sample SRL extraction corresponding to one of the dependency linkages on the left. The SRL ARG labels also provide useful information for designing our heuristics (Section \ref{['sec:data_heus']}).
Figure 3: Model architectures:(a) Non-contextualized model: The model only considers a pair of given text segments. (b) Contextualized model: The model takes the whole instruction paragraphs (i.e. contexts) and wrap each text segment with our special tokens (<a>), where each segment representation is obtained by taking an average over its token representations. The ordered concatenated segment representations will then be fed into an MLP to make the final predictions.
Figure 4: MTurk Annotation User Interface:(a) We ask workers to follow the indicated instruction. All the blue-colored text bars on the top of the page are expandable. Workers can click to expand them for detailed instructions of the annotation task. (b) The annotation task is designed for an intuitive click/select-then-link usage, followed by a few additional questions such as confidence level and feedback (this example is obtained from WikiHow dataset). The grey-color-highlighted text segments are postulated by the SRL, where the color of a segment will turn yellow if either being selected or cursor highlighted. Notice that for better illustration, the directions of the links in our paper are opposite to those in the annotation process.

Learning Action Conditions from Instructional Manuals for Instruction Understanding

TL;DR

Abstract

Learning Action Conditions from Instructional Manuals for Instruction Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (4)