Human I/O: Towards a Unified Approach to Detecting Situational Impairments

Xingyu Bruce Liu; Jiahao Nick Li; David Kim; Xiang 'Anthony' Chen; Ruofei Du

Human I/O: Towards a Unified Approach to Detecting Situational Impairments

Xingyu Bruce Liu, Jiahao Nick Li, David Kim, Xiang 'Anthony' Chen, Ruofei Du

TL;DR

The paper tackles the challenge of detecting situational impairments (SIIDs) by reframing them as the availability of human input/output channels. It proposes Human I/O, a unified pipeline that fuses egocentric video/audio, multimodal sensing, and large language model reasoning to predict the availability levels of vision, hearing, vocal, and hands channels with a four-level scale. Empirical evaluation on 60 in-the-wild Ego4D clips yields a MAE of $0.22$ and an ACC of $82\%$, with detailed per-channel performance and latency analyses; a 10-participant user study further demonstrates reduced workload and improved experience when SIID-adaptive behaviors are enabled. The work highlights design implications, open-source tooling, and future directions toward multi-device collaboration and richer sensing, offering a practical path toward more accessible, adaptive interactive systems. Overall, Human I/O provides a scalable, extensible framework to detect SIIDs and informs development of adaptive UI strategies across daily-life contexts.

Abstract

Situationally Induced Impairments and Disabilities (SIIDs) can significantly hinder user experience in contexts such as poor lighting, noise, and multi-tasking. While prior research has introduced algorithms and systems to address these impairments, they predominantly cater to specific tasks or environments and fail to accommodate the diverse and dynamic nature of SIIDs. We introduce Human I/O, a unified approach to detecting a wide range of SIIDs by gauging the availability of human input/output channels. Leveraging egocentric vision, multimodal sensing and reasoning with large language models, Human I/O achieves a 0.22 mean absolute error and a 82% accuracy in availability prediction across 60 in-the-wild egocentric video recordings in 32 different scenarios. Furthermore, while the core focus of our work is on the detection of SIIDs rather than the creation of adaptive user interfaces, we showcase the efficacy of our prototype via a user study with 10 participants. Findings suggest that Human I/O significantly reduces effort and improves user experience in the presence of SIIDs, paving the way for more adaptive and accessible interactive systems in the future.

Human I/O: Towards a Unified Approach to Detecting Situational Impairments

TL;DR

and an ACC of

, with detailed per-channel performance and latency analyses; a 10-participant user study further demonstrates reduced workload and improved experience when SIID-adaptive behaviors are enabled. The work highlights design implications, open-source tooling, and future directions toward multi-device collaboration and richer sensing, offering a practical path toward more accessible, adaptive interactive systems. Overall, Human I/O provides a scalable, extensible framework to detect SIIDs and informs development of adaptive UI strategies across daily-life contexts.

Abstract

Paper Structure (58 sections, 2 equations, 10 figures, 2 tables)

This paper contains 58 sections, 2 equations, 10 figures, 2 tables.

Introduction
Related Work
Situationally Aware Computing
Egocentric Vision
Reasoning Capabilities of Large Language Models
Activity and Environmental Sensing
SIIDs as the Availability of Human I/O Channels
Formative Study
Procedure
Findings
Methods to Predict Channel Availability
Scope & Limitations
Levels of Channel Availability
Design Implications
Human I/O System
...and 43 more sections

Figures (10)

Figure 1: Human input/output channels with channels most commonly used in human-computer interaction highlighted in black. We designed and implemented Human I/O based on these channels.
Figure 2: An example brainstorming whiteboard from a participant.
Figure 3: The Human I/O pipeline comprises three components: (1) an camera and microphone capturing the user's egocentric video and audio stream; (2) video and audio data processing using computer vision, NLP, and audio analysis to obtain contextual information, including user's activity, environment, and direct sensing ; and (3) sending contextual information to a large language model with chain-of-thought prompting techniques, predicting channel availability, and incorporating a smoothing algorithm for enhanced system stability.
Figure 4: Examples of using GPT-3 (text-curie-001) to refine raw image caption results from BLIP-2 to get more accurate descriptions of the current activity and environment.
Figure 5: An illustration of our prompt structure leveraging chain-of-thought (CoT, highlighted) to enable LLMs to predict channel availability from the context.
...and 5 more figures

Human I/O: Towards a Unified Approach to Detecting Situational Impairments

TL;DR

Abstract

Human I/O: Towards a Unified Approach to Detecting Situational Impairments

Authors

TL;DR

Abstract

Table of Contents

Figures (10)