Table of Contents
Fetching ...

Foundation Models for CPS-IoT: Opportunities and Challenges

Ozan Baris, Yizhuo Chen, Gaofeng Dong, Liying Han, Tomoyoshi Kimura, Pengrui Quan, Ruijie Wang, Tianchen Wang, Tarek Abdelzaher, Mario Bergés, Paul Pu Liang, Mani Srivastava

TL;DR

This work analyzes the gap between current CPS-IoT practice and foundation-model-driven approaches, arguing that CPS-IoT demands domain-specific FM design that accounts for diverse sensors, real-time constraints, and physical world interactions. It surveys the state of CPS-IoT FMs, presents preliminary experiments highlighting resource/quality trade-offs, spatial embodiment, historical context, and structural constraints, and offers a cohesive set of desiderata and ecosystem recommendations. The authors advocate for edge-oriented, sensor-aware, neurosymbolic, and knowledge-graph–augmented CPS-IoT FMs, plus community-scale benchmarks and μFMs to enable practical deployment. Collectively, this work outlines a roadmap for building CPS-IoT foundation tooling that can operate within stringent latency, privacy, and safety requirements while enabling broad task generalization and runtime adaptability.

Abstract

Methods from machine learning (ML) have transformed the implementation of Perception-Cognition-Communication-Action loops in Cyber-Physical Systems (CPS) and the Internet of Things (IoT), replacing mechanistic and basic statistical models with those derived from data. However, the first generation of ML approaches, which depend on supervised learning with annotated data to create task-specific models, faces significant limitations in scaling to the diverse sensor modalities, deployment configurations, application tasks, and operating dynamics characterizing real-world CPS-IoT systems. The success of task-agnostic foundation models (FMs), including multimodal large language models (LLMs), in addressing similar challenges across natural language, computer vision, and human speech has generated considerable enthusiasm for and exploration of FMs and LLMs as flexible building blocks in CPS-IoT analytics pipelines, promising to reduce the need for costly task-specific engineering. Nonetheless, a significant gap persists between the current capabilities of FMs and LLMs in the CPS-IoT domain and the requirements they must meet to be viable for CPS-IoT applications. In this paper, we analyze and characterize this gap through a thorough examination of the state of the art and our research, which extends beyond it in various dimensions. Based on the results of our analysis and research, we identify essential desiderata that CPS-IoT domain-specific FMs and LLMs must satisfy to bridge this gap. We also propose actions by CPS-IoT researchers to collaborate in developing key community resources necessary for establishing FMs and LLMs as foundational tools for the next generation of CPS-IoT systems.

Foundation Models for CPS-IoT: Opportunities and Challenges

TL;DR

This work analyzes the gap between current CPS-IoT practice and foundation-model-driven approaches, arguing that CPS-IoT demands domain-specific FM design that accounts for diverse sensors, real-time constraints, and physical world interactions. It surveys the state of CPS-IoT FMs, presents preliminary experiments highlighting resource/quality trade-offs, spatial embodiment, historical context, and structural constraints, and offers a cohesive set of desiderata and ecosystem recommendations. The authors advocate for edge-oriented, sensor-aware, neurosymbolic, and knowledge-graph–augmented CPS-IoT FMs, plus community-scale benchmarks and μFMs to enable practical deployment. Collectively, this work outlines a roadmap for building CPS-IoT foundation tooling that can operate within stringent latency, privacy, and safety requirements while enabling broad task generalization and runtime adaptability.

Abstract

Methods from machine learning (ML) have transformed the implementation of Perception-Cognition-Communication-Action loops in Cyber-Physical Systems (CPS) and the Internet of Things (IoT), replacing mechanistic and basic statistical models with those derived from data. However, the first generation of ML approaches, which depend on supervised learning with annotated data to create task-specific models, faces significant limitations in scaling to the diverse sensor modalities, deployment configurations, application tasks, and operating dynamics characterizing real-world CPS-IoT systems. The success of task-agnostic foundation models (FMs), including multimodal large language models (LLMs), in addressing similar challenges across natural language, computer vision, and human speech has generated considerable enthusiasm for and exploration of FMs and LLMs as flexible building blocks in CPS-IoT analytics pipelines, promising to reduce the need for costly task-specific engineering. Nonetheless, a significant gap persists between the current capabilities of FMs and LLMs in the CPS-IoT domain and the requirements they must meet to be viable for CPS-IoT applications. In this paper, we analyze and characterize this gap through a thorough examination of the state of the art and our research, which extends beyond it in various dimensions. Based on the results of our analysis and research, we identify essential desiderata that CPS-IoT domain-specific FMs and LLMs must satisfy to bridge this gap. We also propose actions by CPS-IoT researchers to collaborate in developing key community resources necessary for establishing FMs and LLMs as foundational tools for the next generation of CPS-IoT systems.

Paper Structure

This paper contains 39 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Test settings. Top: TSFM. Bottom: LLM.
  • Figure 2: An architecture for spatial representation learning.
  • Figure 3: Performance comparison of multi-node classification (measured by accuracy $\uparrow$ and F1 score $\uparrow$) and tracking (measured by MSE $\downarrow$ and MAE $\downarrow$) tasks.
  • Figure 4: (a) Sanitary protocol violation in smart home health monitoring system. (b) Detecting coordinated terrorist attacks at different locations across the city using the surveillance system. (c) In a real-time complex event detection (CED) task, only the raw sensor streams and ground-truth complex event labels are provided.
  • Figure 5: Average F1 scores of models on complex events with different temporal spans.
  • ...and 1 more figures