Table of Contents
Fetching ...

DOLOMITES: Domain-Specific Long-Form Methodical Tasks

Chaitanya Malaviya, Priyanka Agrawal, Kuzman Ganchev, Pranesh Srinivasan, Fantine Huot, Jonathan Berant, Mark Yatskar, Dipanjan Das, Mirella Lapata, Chris Alberti

Abstract

Experts in various fields routinely perform methodical writing tasks to plan, organize, and report their work. From a clinician writing a differential diagnosis for a patient, to a teacher writing a lesson plan for students, these tasks are pervasive, requiring to methodically generate structured long-form output for a given input. We develop a typology of methodical tasks structured in the form of a task objective, procedure, input, and output, and introduce DoLoMiTes, a novel benchmark with specifications for 519 such tasks elicited from hundreds of experts from across 25 fields. Our benchmark further contains specific instantiations of methodical tasks with concrete input and output examples (1,857 in total) which we obtain by collecting expert revisions of up to 10 model-generated examples of each task. We use these examples to evaluate contemporary language models highlighting that automating methodical tasks is a challenging long-form generation problem, as it requires performing complex inferences, while drawing upon the given context as well as domain knowledge.

DOLOMITES: Domain-Specific Long-Form Methodical Tasks

Abstract

Experts in various fields routinely perform methodical writing tasks to plan, organize, and report their work. From a clinician writing a differential diagnosis for a patient, to a teacher writing a lesson plan for students, these tasks are pervasive, requiring to methodically generate structured long-form output for a given input. We develop a typology of methodical tasks structured in the form of a task objective, procedure, input, and output, and introduce DoLoMiTes, a novel benchmark with specifications for 519 such tasks elicited from hundreds of experts from across 25 fields. Our benchmark further contains specific instantiations of methodical tasks with concrete input and output examples (1,857 in total) which we obtain by collecting expert revisions of up to 10 model-generated examples of each task. We use these examples to evaluate contemporary language models highlighting that automating methodical tasks is a challenging long-form generation problem, as it requires performing complex inferences, while drawing upon the given context as well as domain knowledge.
Paper Structure (51 sections, 12 figures, 17 tables)

This paper contains 51 sections, 12 figures, 17 tables.

Figures (12)

  • Figure 1: Dolomites contains descriptions of 519 methodical tasks elicited from domain experts across various fields. We instantiate these tasks with examples that contain plausible inputs and outputs, formulating a challenging long-form generation problem that requires domain expertise and structured problem-solving.
  • Figure 2: A sample of methodical tasks from law, biology and medicine in Dolomites. Each task in Dolomites follows a standard template, containing a task objective, task procedure, additional notes about the task, and finally, input sections that are usually expected for the task, and output sections that need to be produced as part of the task. These tasks are instantiated with examples that represent plausible inputs and outputs for the task (section \ref{['sec:example_collection']}).
  • Figure 3: We conducted validation of methodical tasks in the Dolomites task collection by consulting an independent group of 3 experts from the field to which the task belongs. Here we show the Likert distributions of their ratings across various axes of importance. The question associated with each axis is listed in section \ref{['sec:task_validation']}.
  • Figure 4: Here, we outline the method for constructing examples of tasks in Dolomites. Using the task objective for a task, we first generate more specific queries to search for relevant web documents, where we constrain our search to authoritative domain names for the task. Using a set of retrieved evidence passages and the complete task description, we then generate an example of the task that fits the task structure using a language model. This example is then post-edited by the same expert who provided the task (further described in section \ref{['subsubsec:post-editing']}).
  • Figure 5: Expert judgements of original examples along three dimensions: task structure followed (whether the example includes all the input and output sections from the task description), level of detail (whether the example shows a detailed and concrete sample of the task) and factual correctness (whether the example is factually correct).
  • ...and 7 more figures