Table of Contents
Fetching ...

IoT-LM: Large Multisensory Language Models for the Internet of Things

Shentong Mo, Russ Salakhutdinov, Louis-Philippe Morency, Paul Pu Liang

TL;DR

IoT-LM tackles the challenge of learning from richly multisensory IoT data by grounding a pretrained large language model with a dedicated multisensory encoder and a novel multisensory multitask adapter. The authors introduce the 1.15-million-sample MultiIoT dataset spanning 12 modalities and 8 tasks, and demonstrate joint learning through multisensory pretraining and instruction tuning. The approach yields strong improvements across 8 IoT tasks, enables zero-shot and few-shot transfer, and exhibits favorable scaling properties, establishing a foundation for interactive, reasoning-enabled IoT systems. By releasing data, models, and training code, IoT-LM aims to accelerate practical development of sensor-grounded language reasoning for smart devices and cities.

Abstract

The Internet of Things (IoT) network integrating billions of smart physical devices embedded with sensors, software, and communication technologies is a critical and rapidly expanding component of our modern world. The IoT ecosystem provides a rich source of real-world modalities such as motion, thermal, geolocation, imaging, depth, sensors, and audio to recognize the states of humans and physical objects. Machine learning presents a rich opportunity to automatically process IoT data at scale, enabling efficient inference for understanding human wellbeing, controlling physical devices, and interconnecting smart cities. To realize this potential, we introduce IoT-LM, an open-source large multisensory language model tailored for the IoT ecosystem. IoT-LM is enabled by two technical contributions: the first is MultiIoT, the most expansive unified IoT dataset to date, encompassing over 1.15 million samples from 12 modalities and 8 tasks prepared for multisensory pre-training and instruction-tuning. The second is a new multisensory multitask adapter layer to condition pre-trained large language models on multisensory IoT data. Not only does IoT-LM yield substantial improvements on 8 supervised IoT classification tasks, but it also demonstrates new interactive question-answering, reasoning, and dialog capabilities conditioned on IoT sensors. We release IoT-LM's data sources and new multisensory language modeling framework.

IoT-LM: Large Multisensory Language Models for the Internet of Things

TL;DR

IoT-LM tackles the challenge of learning from richly multisensory IoT data by grounding a pretrained large language model with a dedicated multisensory encoder and a novel multisensory multitask adapter. The authors introduce the 1.15-million-sample MultiIoT dataset spanning 12 modalities and 8 tasks, and demonstrate joint learning through multisensory pretraining and instruction tuning. The approach yields strong improvements across 8 IoT tasks, enables zero-shot and few-shot transfer, and exhibits favorable scaling properties, establishing a foundation for interactive, reasoning-enabled IoT systems. By releasing data, models, and training code, IoT-LM aims to accelerate practical development of sensor-grounded language reasoning for smart devices and cities.

Abstract

The Internet of Things (IoT) network integrating billions of smart physical devices embedded with sensors, software, and communication technologies is a critical and rapidly expanding component of our modern world. The IoT ecosystem provides a rich source of real-world modalities such as motion, thermal, geolocation, imaging, depth, sensors, and audio to recognize the states of humans and physical objects. Machine learning presents a rich opportunity to automatically process IoT data at scale, enabling efficient inference for understanding human wellbeing, controlling physical devices, and interconnecting smart cities. To realize this potential, we introduce IoT-LM, an open-source large multisensory language model tailored for the IoT ecosystem. IoT-LM is enabled by two technical contributions: the first is MultiIoT, the most expansive unified IoT dataset to date, encompassing over 1.15 million samples from 12 modalities and 8 tasks prepared for multisensory pre-training and instruction-tuning. The second is a new multisensory multitask adapter layer to condition pre-trained large language models on multisensory IoT data. Not only does IoT-LM yield substantial improvements on 8 supervised IoT classification tasks, but it also demonstrates new interactive question-answering, reasoning, and dialog capabilities conditioned on IoT sensors. We release IoT-LM's data sources and new multisensory language modeling framework.
Paper Structure (31 sections, 3 equations, 9 figures, 5 tables)

This paper contains 31 sections, 3 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Illustration of IoT-LM architecture, highlights the integration of multisensory data through modality-specific encoders and the novel multisensory multitask adapter layer. We illustrate how different sensory inputs are processed, combined, and utilized to adapt a pre-trained language model for IoT applications to handle and interpret complex, real-world sensor data efficiently.
  • Figure 2: Illustration of the multisensory multitask adapter layer we designed for IoT-LM. The adapter takes multiple sensor features extracted from dedicated input encoders, performs multimodal fusion into higher-order representations, and simultaneously transforms all fused multimodal features into the same representation space for an LLM to process.
  • Figure 3: Illustration of IoT-LM instruction tuning paradigm that learns to perform specific tasks based on directive inputs. By training on a diverse range of input modalities and output tasks, this enables IoT-LM to process multiple IoT inputs and execute complex tasks.
  • Figure 4: Dialog for audio example. Our IoT-LM accurately predicts the activity corresponding to the input audio spectrogram, and gives a reasonable explanation for its prediction.
  • Figure 5: Dialog for IMU example. IoT-LM accurately predicts the activity corresponding to the input IMU data, and also gives a reasonable explanation for its prediction.
  • ...and 4 more figures