User interfaces, interaction design, accessibility, and social computing
LLM-powered computer-use agents (CUAs) are shifting users from direct manipulation to supervisory coordination. Existing oversight mechanisms, however, have largely been studied as isolated interface features, making broader oversight strategies difficult to compare. We conceptualize CUA oversight as a structural coordination problem defined by delegation structure and engagement level, and use this lens to compare four oversight strategies in a mixed-methods study with 48 participants in a live web environment. Our results show that oversight strategy more reliably shaped users' exposure to problematic actions than their ability to correct them once visible. Plan-based strategies were associated with lower rates of agent problematic-action occurrence, but not equally strong gains in runtime intervention success once such actions became visible. On subjective measures, no single strategy was uniformly best, and the clearest context-sensitive differences appeared in trust. Qualitative findings further suggest that intervention depended not only on what controls users retained, but on whether risky moments became legible as requiring judgment during execution. These findings suggest that effective CUA oversight is not achieved by maximizing human involvement alone. Instead, it depends on how supervision is structured to surface decision-critical moments and support their recognition in time for meaningful intervention.
Wearable devices increasingly support stress detection, while LLMs enable conversational mental health support. However, designing systems that meaningfully connect wearable-triggered stress events with generative dialogue remains underexplored, particularly from a design perspective. We present EmBot, a functional mobile application that combines wearable-triggered stress detection with LLM-based conversational support for daily stress management. We used EmBot as a design probe in semi-structured interviews with 15 mental health experts to examine their perspectives and surface early design tensions and considerations that arise from wearable-triggered conversational support, informing the future design of such systems for daily stress management and mental health support.
Board games have shown promise as educational tools, but their use in engaging learners with the complex, long-term trade-offs of forest management remains strikingly underdeveloped. Addressing this gap, we investigate how forest growth simulation data can inform decision-making through information visualization and gameplay mechanics. We designed a serious game, SIMA-Play, that enables players to make informed forest management decisions under dynamic environmental and market conditions, simulating forest growth over time and comparing player performance across economic and sustainability outcomes. By using visualization to give players feedback on their choices, at the end of the game, it supports systems thinking and makes the trade-offs in forestry practices easier to understand and discuss. The study concludes with a research roadmap that outlines future experiments, longitudinal studies, and digital versions of SIMA-Play to assess its long-term effects on learning and engagement.
Public attitudes toward artificial intelligence (AI) and driving safety are typically studied in isolation using variable-centered methods that assume population homogeneity, yet risk perception theory predicts that these evaluations covary within individuals as expressions of underlying worldviews. This study identifies latent profiles of AI risk perception among U.S. adults and tests whether these profiles are differentially associated with community driving safety concerns. Latent class analysis was applied to nine AI risk-perception indicators from a nationally representative survey (Pew Research Center American Trends Panel Wave 152, n = 5,255); Bolck-Croon-Hagenaars corrected distal outcome analysis tested class differences on nine driving-safety outcomes, and survey-weighted multinomial logistic regression identified demographic and ideological predictors of class membership. Four classes emerged: Moderate Skeptics (17.5%), Concerned Pragmatists (42.8%), AI Ambivalent (10.6%), and Extreme Alarm (29.1%), with all nine driving-safety outcomes significantly differentiated across classes. Higher AI concern mapped monotonically onto greater perceived driving-hazard severity; the exception, comparative evaluation of AI versus human driving, was driven by trust rather than concern level. The cross-domain covariation provides person-level evidence for the worldview-based risk structuring posited by Cultural Theory of Risk and yields a four-class segmentation framework for AV communication that links AI risk orientation to transportation safety attitudes.
Surface electromyography (sEMG) sensors are widely used in human-computer interaction, yet the failure of a single sensor can compromise system usability. We propose a methodological framework for implementing a fail-safe mechanism in multi-sensor sEMG systems. Using arm sEMG recordings of rock-paper-scissors gestures, we extracted hand-crafted features and quantified class separability via the maximum Fisher discriminant ratio (FDR). A multi-layer perceptron validated our approach, consistent with prior findings and physiological evidence. Systematic sensor ablations and FDR analysis produced a ranking of crucial versus replaceable sensors. This ranking informs robust device design, sensor redundancy, and reliability in clinical and practical applications.
Large language models (LLMs) produce systematically misleading outputs, from hallucinated citations to strategic deception of evaluators, yet these phenomena are studied by separate communities with incompatible terminology. We propose a unified taxonomy organized along three complementary dimensions: degree of goal-directedness (behavioral to strategic deception), object of deception, and mechanism (fabrication, omission, or pragmatic distortion). Applying this taxonomy to 50 existing benchmarks reveals that every benchmark tests fabrication while pragmatic distortion, attribution, and capability self-knowledge remain critically under-covered, and strategic deception benchmarks are nascent. We offer concrete recommendations for developers and regulators, including a minimal reporting template for positioning future work within our framework.
Large language models are increasingly deployed as autonomous agents in multi-agent settings where they communicate intentions and take consequential actions with limited human oversight. A critical safety question is whether agents that publicly commit to actions break those promises when they can privately deviate, and what the consequences are for both themselves and the collective. We study deception as a deviation from a publicly announced action in one-shot normal-form games, classifying each deviation by its effect on individual payoff and collective welfare into four categories: win-win, selfish, altruistic, and sabotaging. By exhaustively enumerating announcement profiles across six canonical games, nine frontier models, and varying group sizes, we identify all opportunities for each deviation type and measure how often agents exploit them. Across all settings, agents deviate from promises in approximately 56.6% of scenarios, but the character of deception varies substantially across models even at similar overall rates. Most critically, for the majority of the models, promise-breaking occurs without verbalized awareness of the fact that they are breaking promises.
Road traffic crashes claim approximately 1.19 million lives annually worldwide, and human error accounts for the vast majority, yet the autonomous vehicle acceptance literature models adoption almost exclusively through technology-centered pull factors such as perceived usefulness and trust. This study examines a moderated mediation model in which perceived community driving-safety concern (PCSC) predicts evaluations of AI versus human driving capability, mediated by Generalized AI Orientation and moderated by personal driving frequency. Weighted structural equation modeling is applied to a nationally representative U.S. probability sample from Pew Research Center's American Trends Panel Wave 152, using Weighted Least Squares Mean and Variance Adjusted (WLSMV)-estimated confirmatory factor analysis on ordinal indicators, bias-corrected bootstrap inference, and seven robustness checks including Imai sensitivity analysis, E-value confounding thresholds, and propensity score matching. Results reveal a dual-pathway mechanism constituting an inconsistent mediation: PCSC exerts a small positive direct effect on AI driving evaluation, consistent with a domain-specific push interpretation, while simultaneously suppressing Generalized AI Orientation, which is itself a strong positive predictor of AI driving evaluation. Conditional indirect effects are negative and statistically significant at low, mean, and high levels of driving frequency. These findings establish a risk-spillover mechanism whereby community driving-safety concern promotes domain-specific AI endorsement yet suppresses domain-general AI enthusiasm, yielding a near-zero net total effect.
Generative AI (GenAI) combined with Extended Reality (XR) offers potential for K-12 education, yet classroom adoption remains limited by the high technical barrier of XR content authoring. Moreover, the probabilistic nature of GenAI introduces risks of hallucination that may cause severe consequences in K-12 education settings. In this work, we present a multi-agent XR authoring framework. Our prototype system coordinates four specialized agents: a Pedagogical Agent outlining grade-appropriate content specifications with learning objectives; an Execution Agent assembling 3D assets and XR contents; a Safeguard Agent validating generated content against five safety criteria; and a Tutor Agent embedding educational notes and quiz questions within the scene. Our teacher-facing system combines pedagogical intent, safety validation, and educational enrichment. It does not require technical expertise and targets commodity devices.
Large language models (LLMs) are bringing richer dialogue and social behavior into games, but they also expose a control problem that existing game interfaces do not directly address: how should LLM characters participate in live multiplayer interaction while remaining executable in the shared game world, socially coherent with other active characters, and steerable by players when needed? We frame this problem as bounded autonomy, a control architecture for live multiplayer games that organizes LLM character control around three interfaces: agent-agent interaction, agent-world action execution, and player-agent steering. We instantiate bounded autonomy with probabilistic reply-chain decay, an embedding-based action grounding pipeline with fallback, and whisper, a lightweight soft-steering technique that lets players influence a character's next move without fully overriding autonomy. We deploy this architecture in a live multiplayer social game and study its behavior through analyses of interaction stability, grounding quality, whisper intervention success, and formative interviews. Our results show how bounded autonomy makes LLM character interaction workable in practice, frames controllability as a distinct runtime control problem for LLM characters in live multiplayer games, and provides a concrete exemplar for future games built around this interaction paradigm.
Empathy has been discussed as a relevant human capability in software engineering, particularly in activities that require understanding users, stakeholders, and the societal implications of technological systems. This relevance becomes more pronounced in the context of artificial intelligence, where software increasingly participates in decisions that affect diverse individuals and communities. However, limited guidance exists on how empathy can be integrated into technical software engineering education in ways that connect with the development of AI-enabled systems. This study investigates teaching practices that educators use to incorporate empathy into software engineering courses. Using qualitative analysis of educator-reported practices, we identified five categories through which empathy is operationalized within technical coursework: societal framing of AI systems, fairness and accessibility considerations in design and evaluation, representation of diverse users, stakeholder role awareness and responsibility, and structured reflection and feedback during development processes. The findings indicate that empathy can be embedded within core development activities rather than taught as a separate topic, enabling students to reason about bias, accessibility, accountability, and the societal consequences of AI technologies. These results contribute a structured view of how empathy-oriented practices can be incorporated into software engineering education to support the preparation of students who will develop AI-enabled systems.
Community Health Workers (CHWs) play a critical role in delivering primary healthcare services in low-resource settings, yet sustaining their training and performance remains a persistent challenge. Prior research has explored digital and game-based approaches for CHW training. However, limited work has synthesized longitudinal design insights into generalizable guidelines for interactive health interventions. Building on a four-year design-based research program involving multiple game-based refresher training systems, including quiz-based mobile apps, physical and augmented reality games, card-based games, and location-based games, we examine which design guidelines support sustained engagement, learning transfer, and contextual appropriateness in CHW training. We conducted a mixed-methods analysis across deployments with Accredited Social Health Activists and Anganwadi Workers in India, including interviews, field observations, and usage logs. Through thematic synthesis, we derive eight design guidelines addressing contextual realism, adaptive learning, hybrid interaction, social motivation, explainability, professional identity, and ethical considerations. Our findings contribute actionable design knowledge for researchers and practitioners developing interactive health interventions in low-resource healthcare contexts.
2604.04669Digital health technologies are increasingly used to improve healthcare access and delivery worldwide. However, many healthcare applications are designed for environments with stable infrastructure, high digital literacy, and strong institutional support. These assumptions often do not hold in low-resource contexts where healthcare delivery often depends on community health workers, caregivers, and informal care networks. Designing effective healthcare applications for such environments requires attention to infrastructural constraints, cultural contexts, language diversity, and usability challenges. This Birds of a Feather session aims to bring together researchers, designers, and practitioners interested in healthcare application design in low-resource contexts. The session will provide an informal forum for discussing challenges encountered in the design and deployment of digital health technologies in underserved settings, sharing field experiences, and identifying opportunities for collaboration within the Interactive Health (IH) community.
Thumb gestures provide an effective and unobtrusive input modality for wearable and always-available human-machine interaction. Wrist-worn surface electromyography (sEMG) has emerged as a promising approach for compact and wearable human-machine interfaces. However, compared to forearm sEMG, the impact of electrode configuration on wrist-based decoding performance remains understudied. We systematically investigated electrode configuration strategies for wrist-based thumb-movement recognition using high-density (HD) and low-density (LD) sEMG measurement systems. We considered factors such as muscle region, reference scheme, channel count, and spatial density of the electrode. Experimental results show that 1) extensor-side electrodes outperform flexor-side electrodes (HD: 0.871 vs. 0.821; LD: 0.769 vs. 0.705); 2) monopolar recordings consistently outperform bipolar configurations (15 channel with HD monopolar vs. LD bipolar: 0.885 vs. 0.823); and 3) increasing channel count enhances performance, but exhibits diminishing returns. We further show that electrode spatial distribution introduces a trade-off between spatial coverage and compactness. The findings suggest that the effectiveness of wrist-worn sEMG systems depends less on the deployment of a large number of electrodes in a broad sensing area and more on the optimization of electrode placement and the referencing scheme. This work provides practical guidelines for developing efficient wrist-worn sEMG-based gesture recognition systems.
AI agents - i.e. AI systems that autonomously plan, invoke external tools, and execute multi-step action chains with reduced human involvement - are being deployed at scale across enterprise functions ranging from customer service and recruitment to clinical decision support and critical infrastructure management. The EU AI Act (Regulation 2024/1689) regulates these systems through a risk-based framework, but it does not operate in isolation: providers face simultaneous obligations under the GDPR, the Cyber Resilience Act, the Digital Services Act, the Data Act, the Data Governance Act, sector-specific legislation, the NIS2 Directive, and the revised Product Liability Directive. This paper provides the first systematic regulatory mapping for AI agent providers integrating (a) draft harmonised standards under Standardisation Request M/613 to CEN/CENELEC JTC 21 as of January 2026, (b) the GPAI Code of Practice published in July 2025, (c) the CRA harmonised standards programme under Mandate M/606 accepted in April 2025, and (d) the Digital Omnibus proposals of November 2025. We present a practical taxonomy of nine agent deployment categories mapping concrete actions to regulatory triggers, identify agent-specific compliance challenges in cybersecurity, human oversight, transparency across multi-party action chains, and runtime behavioral drift. We propose a twelve-step compliance architecture and a regulatory trigger mapping connecting agent actions to applicable legislation. We conclude that high-risk agentic systems with untraceable behavioral drift cannot currently satisfy the AI Act's essential requirements, and that the provider's foundational compliance task is an exhaustive inventory of the agent's external actions, data flows, connected systems, and affected persons.
What makes a public talk resonate with large audiences? While prior research has emphasized speaker delivery or topic novelty, we reasoned that a core driver of engagement is linguistic clarity. This aligns with theories of processing fluency and cognitive load, which posit that audiences reward speakers who present complex ideas accessibly. We leveraged artificial intelligence to analyze 1,239 TED Talk transcripts (2006--2013), supplemented by a later-phase longitudinal sample. Each transcript was evaluated across 50 independent large language model runs on two dimensions, clarity of explanation and structural organization, and linked to YouTube engagement metrics (likes and views).Clarity emerged as the strongest predictor of audience responses ($β= .339$ for likes; $β= .314$ for views), contributing substantial incremental variance ($ΔR^{2} \approx .095$) beyond duration, topic, and scientific status. The full model explained 29\% of variance in likes and 22.5\% in views. This effect was domain-general, remaining invariant across content categories and between scientific and non-scientific talks. Notably, clarity outperformed traditional readability metrics, indicating that discourse coherence predicts engagement more powerfully than surface-level linguistic simplicity. Longitudinal analyses further revealed standardization within TED, characterized by increasing clarity and reduced variability over time. Theoretically, these results support processing fluency accounts: clearer communication reduces cognitive friction and elicits more positive evaluative responses. Practically, transcript-based clarity represents a scalable and trainable strategy for improving public discourse. By demonstrating that language models can reliably capture latent communicative qualities, this study paves the way for feedback systems in education, science communication, and public speaking.
Policy researchers need scalable ways to surface public views, yet they often rely on interviews, listening sessions, and surveys-analyzed thematically-that are slow, expensive, and limited in scale and diversity. LLMs offer new possibilities for thematic analysis of unstructured text, yet we know little about how LLM-assisted workflows perform for policy research. Building on a workflow for LLM-assisted thematic analysis of online forums, we conduct a study with 11 policy researchers, who use an early prototype and see it as a quick, rough-and-ready input to their research. We then extend and scale the workflow to analyze millions of Reddit posts and 1,058 chatbot-led interview transcripts on a policy-relevant topic, treating these sources as rich and scalable data for policy discourse. We compare the synthesized themes to those from authoritative policy reports, identify points of alignment and divergence, and discuss what this implies for policy researchers adopting LLM-assisted workflows.
The deployment of Large Language Models (LLMs) has ignited concerns about technological unemployment. Existing task-based evaluations predominantly measure theoretical "exposure" to AI capabilities, ignoring critical frictions of real-world commercial adoption: liability, compliance, and physical safety. We argue occupations are not eradicated instantaneously, but gradually encroached upon via atomic actions. We introduce a Tech-Risk Dual-Factor Model to re-evaluate this. By deconstructing 923 occupations into 2,087 Detailed Work Activities (DWAs), we utilize a multi-agent LLM ensemble to score both technical feasibility and business risk. Through variance-based Human-in-the-Loop (HITL) validation with an expert panel, we demonstrate a profound cognitive gap: isolated algorithmic probabilities fail to encapsulate the "institutional premium" imposed by experts bounded by professional liability. Applying a strictly algorithmic baseline via mathematical bottleneck aggregation, we calculate Relative Occupational Automation Indices ($OAI$) for the U.S. labor market. Our findings challenge the traditional Routine-Biased Technological Change (RBTC) hypothesis. Non-routine cognitive roles highly dependent on symbolic manipulation (e.g., Data Scientists) face unprecedented exposure ($OAI \approx 0.70$). Conversely, unstructured physical trades and high-stakes caretaking roles exhibit absolute resilience, quantifying a profound "Cognitive Risk Asymmetry." We hypothesize the emergent necessity of a "Compliance Premium," indicating wage resilience increasingly tied to risk-absorption capacity. We frame these findings as a cross-sectional diagnostic of systemic vulnerability, establishing a foundation for subsequent Computable General Equilibrium (CGE) econometric modeling involving dynamic wage elasticity and structural labor reallocation.
Affordances, originating in psychology, describe how an object's design influences the physical and cognitive actions users may take. Past work applied affordance theory to visualization to explain how design decisions can impact the cognitive actions of visualization readers. In this work, we demonstrate that affordances can complement effectiveness rankings by further explaining the root causes behind visualizations' task performance. To do so, we conduct a case study on static normal probability density function plots, identifying their current affordances. Next, we identify the optimal affordances for a common probability-comparison task and develop a novel affordance-driven visualization, the Croissant Chart, to support them. We empirically validate the design's effectiveness through a preregistered study (n = 808), demonstrating how affordances can inform predictable changes in task performance. Our findings underscore the potential for affordance-based approaches to enhance visualization effectiveness and inform future design decisions.
As LLMs are deployed in high-stakes settings, users must judge the correctness of individual responses, often relying on model-generated justifications such as reasoning chains or explanations. Yet, no standard measure exists for whether these justifications help users distinguish correct answers from incorrect ones. We formalize this idea as error verifiability and propose $v_{\text{bal}}$, a balanced metric that measures whether justifications enable raters to accurately assess answer correctness, validated against human raters who show high agreement. We find that neither common approaches, such as post-training and model scaling, nor more targeted interventions recommended improve verifiability. We introduce two methods that succeed at improving verifiability: reflect-and-rephrase (RR) for mathematical reasoning and oracle-rephrase (OR) for factual QA, both of which improve verifiability by incorporating domain-appropriate external information. Together, our results establish error verifiability as a distinct dimension of response quality that does not emerge from accuracy improvements alone and requires dedicated, domain-aware methods to address.