Table of Contents
Fetching ...

IWISDM: Assessing instruction following in multimodal models at scale

Xiaoxuan Lei, Lucas Gomez, Hao Yuan Bai, Pouya Bashivan

TL;DR

iWISDM addresses the need to evaluate instruction-following in multimodal tasks by introducing a scalable environment that procedurally generates vision-language tasks from compositional graphs. The methodology combines a 3-phase task-graph design with AutoTask parameterization to create a vast, temporally compositional task space. Three benchmarks at Low, Medium, and High complexity reveal a consistent gap between state-of-the-art LMMs and human performance in following multi-image instructions. The work provides a framework and starting point for continual learning and compositional generalization evaluation, with potential extensions like new operators and a public leaderboard.

Abstract

The ability to perform complex tasks from detailed instructions is a key to many remarkable achievements of our species. As humans, we are not only capable of performing a wide variety of tasks but also very complex ones that may entail hundreds or thousands of steps to complete. Large language models and their more recent multimodal counterparts that integrate textual and visual inputs have achieved unprecedented success in performing complex tasks. Yet, most existing benchmarks are largely confined to single-modality inputs (either text or vision), narrowing the scope of multimodal assessments, particularly for instruction-following in multimodal contexts. To bridge this gap, we introduce the instructed-Virtual VISual Decision Making (iWISDM) environment engineered to generate a limitless array of vision-language tasks of varying complexity. Using iWISDM, we compiled three distinct benchmarks of instruction following visual tasks across varying complexity levels and evaluated several newly developed multimodal models on these benchmarks. Our findings establish iWISDM as a robust benchmark for assessing the instructional adherence of both existing and emergent multimodal models and highlight a large gap between these models' ability to precisely follow instructions with that of humans.The code of iWISDM is available on GitHub at https://github.com/BashivanLab/iWISDM.

IWISDM: Assessing instruction following in multimodal models at scale

TL;DR

iWISDM addresses the need to evaluate instruction-following in multimodal tasks by introducing a scalable environment that procedurally generates vision-language tasks from compositional graphs. The methodology combines a 3-phase task-graph design with AutoTask parameterization to create a vast, temporally compositional task space. Three benchmarks at Low, Medium, and High complexity reveal a consistent gap between state-of-the-art LMMs and human performance in following multi-image instructions. The work provides a framework and starting point for continual learning and compositional generalization evaluation, with potential extensions like new operators and a public leaderboard.

Abstract

The ability to perform complex tasks from detailed instructions is a key to many remarkable achievements of our species. As humans, we are not only capable of performing a wide variety of tasks but also very complex ones that may entail hundreds or thousands of steps to complete. Large language models and their more recent multimodal counterparts that integrate textual and visual inputs have achieved unprecedented success in performing complex tasks. Yet, most existing benchmarks are largely confined to single-modality inputs (either text or vision), narrowing the scope of multimodal assessments, particularly for instruction-following in multimodal contexts. To bridge this gap, we introduce the instructed-Virtual VISual Decision Making (iWISDM) environment engineered to generate a limitless array of vision-language tasks of varying complexity. Using iWISDM, we compiled three distinct benchmarks of instruction following visual tasks across varying complexity levels and evaluated several newly developed multimodal models on these benchmarks. Our findings establish iWISDM as a robust benchmark for assessing the instructional adherence of both existing and emergent multimodal models and highlight a large gap between these models' ability to precisely follow instructions with that of humans.The code of iWISDM is available on GitHub at https://github.com/BashivanLab/iWISDM.
Paper Structure (26 sections, 1 equation, 14 figures, 1 table)

This paper contains 26 sections, 1 equation, 14 figures, 1 table.

Figures (14)

  • Figure 1: Illustration of coffee making task as an example real world compositionally constructed task. a) Cartoon depiction of a kitchen with typical objects within. The dialogue boxes highlight the relevant properties of the grinder, coffee bag, and boiler; b) Pseudocode detailing the coffee-making task, encompassing three subtasks – obtaining coffee powder, brewing the coffee, and returning the coffee; c) Graphical representation of the three subtasks as computational graphs. Blue text highlights the properties associated with each operator; d) An abstraction of the real-life coffee-making task as a task in iWISDM. Frame sequences are generated using the task graphs illustrated in c; e) Graphic depiction of object properties actively in use (top) and subtasks (bottom). Coloured bars in the upper panel signify the persistence of each object property during the execution of coffee-making task. Lower panel depicts the interrelations between subtasks and various temporal compositional operations.
  • Figure 2: Illustration of various temporal compositional structures used in iWISDM and their application in instantiating classic decision-making tasks. a) Four distinct temporal composition operations including Queue: where one task is completed before starting the next; Overlap: where two or more tasks share common information; Interleave: depicting the interwoven acquisition of information related to different tasks and; Condition: where the execution of a subsequent task depends on the outcome of the preceding one. An example compositional task consisting of multiple temporal relationships is shown. Frames are coloured differently to highlight distinct tasks, with multicoloured frames indicating shared information across tasks; b) Three classic cognitive tasks are exemplified: Delayed Match to Sample; 2-Back; and Contextual Decision Making.
  • Figure 3: Description of operators and task graph examples in the core build of iWISDM. a) Two main types of operators are considered. Functional Operators: Select operators retrieve stimuli according to specified attributes such as time and location; Switch operators accept boolean values and direct the task logic to one of two different subtasks; GetAttribute operators take stimuli as input and output the corresponding attribute values of those stimuli. Boolean Operators comprise Exit, And, Or, IsSame, and NotSame. These operators take boolean values as inputs and produce another boolean value based on their boolean logic. b-d) Using the operators defined in panel a, we create tasks from pre-specified rules. Panels b, c, and d display the task graphs for the three typical cognitive tasks illustrated in Figure \ref{['fig:2']}b, demonstrating how these operators are interconnected to define the structures of these specific tasks. For instance, the task instruction for panel b and d are: "category of object 1 equals category of object 2?", "if category of object 1 equals category of object 3, then category of object 2 equals category of object 3, else category of object 2 equals category of object 4?'. The task instruction for panel c is similar to panel b but repeated across time.
  • Figure 4: a) Accuracy of LMMs and Humans on iWISDM benchmarks of varying complexity, and prompt type. b), c), d), & e) Average accuracy of applicable models on single-frame, low, medium, and high complexity tasks for each feature type. Chance level represents the baseline accuracy of randomly guessing one of the applicable actions.
  • Figure A1: Task graph initialization and modification in iWISDM. This figure demonstrates the process iWISDM follows to initialize and modify a set of task graphs. We demonstrate the process using two examples: Graph 1 and Graph 2. Backward Initialization a): Step 1: Identify all operators linked to explicit (e.g. IsSame operator in Graph 1, and Get operators in Graph 2) or implicit responses (where agents decide without a direct response, typically preceding the Switch operator, like the Exist operator in Graph 2), and assign responses to each of these operators, manually balancing output action. Step 2: Propagate properties to downstream operators. For example, if the response to IsSame is True, then the children GetCategory operators must have the same output. The Select operator usually receives one property from upstream operators, such as the category Desk in Graph 2's bottom Select. In this example, "when" and "where" properties are randomly sampled and marked with asterisks. To illustrate instruction generation, the operator partial strings are shown as "str: ", and the blanks are filled by its children operators, following the direction of the arrows. However, during temporal composition, backward initialization can create conflicts between graphs, which leads Forward Modification Phase b) to resolve these issues. Step 1: For each graph, gather properties from each Select operator and compare those with matching "when" to identify conflicts. For instance, a conflict is found in the "what" property at frame 4; Graph 1 Select assigns "Desk" while Graph 2's assigns "Plane". Conflicts can also occur with the "where" property. We modify the Select in new task graphs to minimize upstream modifications. Step 2: Adjust the "what" property in Graph 2 to "Desk", aligning with the Select operator in Graph 1. Step 3: Post-modification, it may be necessary to update actions or properties of upstream operators. In this scenario, tracing back to the GetLocation operator in Graph 2 shows that no changes are required for the "where" property, as it remains consistent with the Select operator's assignment.
  • ...and 9 more figures