In-Context Ensemble Learning from Pseudo Labels Improves Video-Language Models for Low-Level Workflow Understanding

Moucheng Xu; Evangelos Chatzaroulas; Luc McCutcheon; Abdul Ahad; Hamzah Azeem; Janusz Marecki; Ammar Anwar

In-Context Ensemble Learning from Pseudo Labels Improves Video-Language Models for Low-Level Workflow Understanding

Moucheng Xu, Evangelos Chatzaroulas, Luc McCutcheon, Abdul Ahad, Hamzah Azeem, Janusz Marecki, Ammar Anwar

TL;DR

It is reported that in-context learning helps video-language models to generate more temporally accurate SOP, and the proposed in-context ensemble learning can consistently enhance the capabilities of the video-language models in SOP generation.

Abstract

A Standard Operating Procedure (SOP) defines a low-level, step-by-step written guide for a business software workflow. SOP generation is a crucial step towards automating end-to-end software workflows. Manually creating SOPs can be time-consuming. Recent advancements in large video-language models offer the potential for automating SOP generation by analyzing recordings of human demonstrations. However, current large video-language models face challenges with zero-shot SOP generation. In this work, we first explore in-context learning with video-language models for SOP generation. We then propose an exploration-focused strategy called In-Context Ensemble Learning, to aggregate pseudo labels of multiple possible paths of SOPs. The proposed in-context ensemble learning as well enables the models to learn beyond its context window limit with an implicit consistency regularisation. We report that in-context learning helps video-language models to generate more temporally accurate SOP, and the proposed in-context ensemble learning can consistently enhance the capabilities of the video-language models in SOP generation.

In-Context Ensemble Learning from Pseudo Labels Improves Video-Language Models for Low-Level Workflow Understanding

TL;DR

Abstract

Paper Structure (9 sections, 11 figures, 1 table)

This paper contains 9 sections, 11 figures, 1 table.

Introduction
In-Context Ensemble (ICE) Learning
Related Work
Experiments
Results
Analysis
Conclusion
A successful case
A failed case

Figures (11)

Figure 1: The proposed Multimodal In-Context Ensemble (ICE) learning with pseudo-labels. ICL: in-context learning. SOP: standard operating procedure, detailing chronological step-by-step actions in the video.
Figure 2: Accumulated histograms of testing results with the ICE with GPT-4o-mini. The red vertical lines indicate the 50-percentile of testing cases.
Figure 3: Kernel density estimate plots of lines of SOPs. ICE with GPT-4o-mini.
Figure 4: Histogram of ratios between number of lines of ground truth SOPs and the number of frames for each testing case.
Figure 5: Violins plots of binned precision according to gold SOP lengths.
...and 6 more figures

In-Context Ensemble Learning from Pseudo Labels Improves Video-Language Models for Low-Level Workflow Understanding

TL;DR

Abstract

In-Context Ensemble Learning from Pseudo Labels Improves Video-Language Models for Low-Level Workflow Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (11)