RadarLLM: Empowering Large Language Models to Understand Human Motion from Millimeter-wave Point Cloud Sequence

Zengyuan Lai^1,2,, Jiarui Yang^1,, Songpengcheng Xia^1,*, Lizhou Lin¹, Lan Sun¹,
Renwen Wang², Jianran Liu², Qi Wu², Ling Pei^1,†

¹Shanghai Jiao Tong University ²Bytedance Research
^*Equal contributions ^†Corresponding author

Millimeter-wave radar provides a privacy-preserving solution for human motion analysis, yet its sparse point clouds pose significant challenges for semantic understanding. We present Radar-LLM, the first framework that leverages large language models (LLMs) for human motion understanding using millimeter-wave radar as the sensing modality. Our approach introduces two key innovations: (1) a motion-guided radar tokenizer based on our Aggregate VQ-VAE architecture that incorporates deformable body templates and masked trajectory modeling to encode spatiotemporal point clouds into compact semantic tokens, and (2) a radar-aware language model that establishes cross-modal alignment between radar and text in a shared embedding space. To address data scarcity, we introduce a physics-aware synthesis pipeline that generates realistic radar-text pairs from motion-text datasets. Extensive experiments demonstrate that Radar-LLM achieves state-of-the-art performance across both synthetic and real-world benchmarks, enabling accurate translation of millimeter-wave signals to natural language descriptions. This breakthrough facilitates comprehensive motion understanding in privacy-sensitive applications like healthcare and smart homes. We will release the full implementation to support further research.

Pipeline

The overview of RadarLLM. We first encode radar point clouds into discrete tokens via a Motion-guided Radar Tokenizer. The Radar-aware Language Model then aligns these tokens with textual representations in a shared embedding space through joint optimization of unsupervised token reconstruction and supervised bidirectional radar-text translation.

Radar Tokenizer

Architecture and training pipeline of motion-guided radar tokenizer. The Motion-guided Radar Tokenizer, built upon our Aggregate VQ-VAE architecture, compresses radar point cloud sequences into discrete semantic tokens through point cloud sequence reconstruction and motion embedding learning.

Virtual Data Generation

Virtual radar-text data generation pipeline. The Radar-Text dataset is constructed by simulating radar reflections from SMPL motion sequences using ray tracing and signal processing techniques, based on existing motion-text datasets.

Citation

@InProceedings{shan2025mojito, title = {RadarLLM: Empowering Large Language Models to Understand Human Motion from Millimeter-wave Point Cloud Sequence}, author = {Lai, Zengyuan and Yang, Jiarui and Xia, Songpengcheng and Lin, Lizhou and Sun, Lan and Wang, Renwen and Liu, Jianran and Wu, Qi and Pei, Ling}, journal = {arXiv preprint arXiv:xxxx.xxxxx}, year = {2025} }

RadarLLM: Empowering Large Language Models to Understand Human Motion from Millimeter-wave Point Cloud Sequence

Zengyuan Lai1,2,*, Jiarui Yang1,*, Songpengcheng Xia1,*, Lizhou Lin1, Lan Sun1, Renwen Wang2, Jianran Liu2, Qi Wu2, Ling Pei1,†

Zengyuan Lai^1,2,, Jiarui Yang^1,, Songpengcheng Xia^1,*, Lizhou Lin¹, Lan Sun¹,
Renwen Wang², Jianran Liu², Qi Wu², Ling Pei^1,†