Human activity prediction (HAP) is crucial for enabling intelligent smart home services; yet, it is often hindered by the scarcity of high-quality, multidimensional datasets. Existing datasets are typically fragmented, capturing either long-term activity sequences or short-term device interactions, but rarely both in a unified manner. Traditional data collection methods are costly and time-consuming, while conventional simulation techniques struggle to generate diverse and logically coherent behavior sequences. To address these limitations, we propose SmartLLM, a novel large language model (LLM)-based simulation framework for automated generation of multidimensional smart home datasets. SmartLLM simulates simulated agents with distinct profiles (e.g., old man, remote worker, and holiday maker) performing daily activities within configurable home environments, generating temporally aligned sequences across activity-device-sensor dimensions. We generate two months of simulated data for three user profiles and validated their plausibility through activity distribution visualization, statistical perplexity analysis, and case studies. Multidimensional feature validation experiments further demonstrate that our multidimensional data significantly enhances the accuracy of activity prediction models compared to using single-dimensional features. This work successfully addresses key bottlenecks in smart home data acquisition and provides a scalable, high-quality data foundation for advancing smart home algorithm research. The code is available at: https://github.com/HuankeZheng/SmartLLM.
SmartLLM: Multidimensional Dataset Generation via LLM Simulation in Smart Home
Fortino, Giancarlo
2026-01-01
Abstract
Human activity prediction (HAP) is crucial for enabling intelligent smart home services; yet, it is often hindered by the scarcity of high-quality, multidimensional datasets. Existing datasets are typically fragmented, capturing either long-term activity sequences or short-term device interactions, but rarely both in a unified manner. Traditional data collection methods are costly and time-consuming, while conventional simulation techniques struggle to generate diverse and logically coherent behavior sequences. To address these limitations, we propose SmartLLM, a novel large language model (LLM)-based simulation framework for automated generation of multidimensional smart home datasets. SmartLLM simulates simulated agents with distinct profiles (e.g., old man, remote worker, and holiday maker) performing daily activities within configurable home environments, generating temporally aligned sequences across activity-device-sensor dimensions. We generate two months of simulated data for three user profiles and validated their plausibility through activity distribution visualization, statistical perplexity analysis, and case studies. Multidimensional feature validation experiments further demonstrate that our multidimensional data significantly enhances the accuracy of activity prediction models compared to using single-dimensional features. This work successfully addresses key bottlenecks in smart home data acquisition and provides a scalable, high-quality data foundation for advancing smart home algorithm research. The code is available at: https://github.com/HuankeZheng/SmartLLM.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


