Practical and Rich User Digitization

Karan Ahuja

Practical and Rich User Digitization

Karan Ahuja

TL;DR

This work defines user digitization as building digital representations of people across activity, pose and behavior, and situates it in a richness-practicality design space. It demonstrates that higher-order digitization, including full-body pose and gaze estimation, can be realized on ubiquitous devices by leveraging sensor fusion, privacy-aware sensing, and synthetic data, thereby preserving practicality. The thesis presents concrete systems—GymCam for unconstrained activity tracking, Pose-on-the-Go for mobile full-body pose, BodySLAM for multi-user digitization, and IMUPoser for passive, device-agnostic pose estimation—along with extensive evaluations and datasets. A core contribution is showing how lower-fidelity sensors can achieve high-utility, by adding intelligent processing and multimodal cues, enabling longitudinal health monitoring, XR telepresence and smarter assistants while addressing privacy concerns. The results indicate promising accuracy and real-time feasibility, with open-source data and models to accelerate future work in practical, rich digitization for everyday devices.

Abstract

A long-standing vision in computer science has been to evolve computing devices into proactive assistants that enhance our productivity, health and wellness, and many other facets of our lives. User digitization is crucial in achieving this vision as it allows computers to intimately understand their users, capturing activity, pose, routine, and behavior. Today's consumer devices - like smartphones and smartwatches provide a glimpse of this potential, offering coarse digital representations of users with metrics such as step count, heart rate, and a handful of human activities like running and biking. Even these very low-dimensional representations are already bringing value to millions of people's lives, but there is significant potential for improvement. On the other end, professional, high-fidelity comprehensive user digitization systems exist. For example, motion capture suits and multi-camera rigs that digitize our full body and appearance, and scanning machines such as MRI capture our detailed anatomy. However, these carry significant user practicality burdens, such as financial, privacy, ergonomic, aesthetic, and instrumentation considerations, that preclude consumer use. In general, the higher the fidelity of capture, the lower the user's practicality. Most conventional approaches strike a balance between user practicality and digitization fidelity. My research aims to break this trend, developing sensing systems that increase user digitization fidelity to create new and powerful computing experiences while retaining or even improving user practicality and accessibility, allowing such technologies to have a societal impact. Armed with such knowledge, our future devices could offer longitudinal health tracking, more productive work environments, full body avatars in extended reality, and embodied telepresence experiences, to name just a few domains.

Practical and Rich User Digitization

TL;DR

Abstract

Paper Structure (204 sections, 38 figures, 6 tables)

This paper contains 204 sections, 38 figures, 6 tables.

Introduction
User Digitization
Defining User Digitization
Spectrum of User Digitization
Digitization Richness
User Practicality
Relationship between Digitization Richness and User Practicality
Glossary
Organization of Thesis
Background
Human Activity Recognition
Camera-Based Human Activity Recognition
Audio-Based Human Activity Recognition
IMU-Based Human Activity Recognition
Multimodal Human Activity Recognition
...and 189 more sections

Figures (38)

Figure 1: Spectrum of user digitization richness going from lower-order user digitization (on the left) to higher order digitization (on the right).
Figure 2: A design space plotting digitization richness vs. user practicality. Most approaches lie along the diagonal, where practicality decreases as richness increases. Ideally, we want systems that offer higher richness without commensurate practicality trade-offs.
Figure 3: Future smart homes and offices are envisioned to contain many “smart” devices able to respond to voice commands. However, without device-specific wakewords, multiple devices may try to respond to generic queries (left). Ideally, users would be able to face and speak to a device, more akin to human-human interaction (center and right). Thus, there is a need for Speaker Head-Pose estimation approaches, especially those than can run locally on self-contained devices, without having to install extra sensors in the environment or rely on multi-device interoperability, which does not appear to be forthcoming in the near future.
Figure 4: Video and audio from classroom cameras first flows into a scene parsing layer which captures the user's pose, before being featurized by a series of specialized modules.
Figure 5: EduSense Gaze and Classroom Topology. Left: Percentage of student gaze across various classroom foci (whiteboards, projector screens, lectern) at the end of a class session. Center: Heatmaps of students gaze across the same foci. Right: Heatmap of the instructor gaze aggregated across a class session.
...and 33 more figures

Practical and Rich User Digitization

TL;DR

Abstract

Practical and Rich User Digitization

Authors

TL;DR

Abstract

Table of Contents

Figures (38)