Generating physically realistic humanoid–object interactions (HOI) is a fundamental challenge in robotics. Existing HOI generation approaches, such as diffusion-based models, often suffer from artifacts such as implausible contacts, penetrations, and unrealistic whole-body actions, which hinder successful execution in physical environments. To address these challenges, we introduce SimGenHOI , a unified framework that combines the strengths of generative modeling and reinforcement learning to produce controllable and physically plausible HOI. Our HOI generative model, based on Diffusion Transformers (DiT), predicts a set of key actions conditioned on text prompts, object geometry, sparse object waypoints, and the initial humanoid pose. These key actions capture essential interaction dynamics and are interpolated into smooth motion trajectories, naturally supporting long-horizon generation. To ensure physical realism, we design a contact-aware whole-body control policy trained with reinforcement learning, which tracks the generated motions while correcting artifacts such as penetration and foot sliding. Furthermore, we introduce a mutual fine-tuning strategy, where the generative model and the control policy iteratively refine each other, improving both motion realism and tracking robustness. Extensive experiments demonstrate that SimGenHOI generates realistic, diverse, and physically plausible humanoid–object interactions, achieving significantly higher tracking success rates in simulation and enabling long-horizon manipulation tasks.
Our proposed framework uses a diffusion model for key action generation and reinforcement learning to train a contact-aware HOI policy in the Isaac Gym simulator. The proposed framework, as depicted in (c), mainly consists of two components, including: (a) HOI generation conditioned on text prompt, object geometry, sparse waypoints of objects, and initial humanoid pose. (b) Contact-aware humanoid-object interaction (HOI) policy; \( \hat G \), \( q^{obj}_{t+1:t+n}\) and \( q_t^{obj}\) represent object's geometry, future trajectory and its current state. \( \hat c_{t+1}\), \( s^p_t\) and \(\hat s^p_{t+1}\) denote the contact guidance, current state of the humanoid, and its next state reference.
We compare our method against the existing CHOIS approach [Li et al., ECCV 2024]. Our full model uses 1000 denoising steps, while Ours_200 is a variant with only 200 steps. Ours w/o weight refers to the version where joint weighting is removed during key action extraction.
@article{lin2025simgenhoi,
title={SimGenHOI: Physically Realistic Whole-Body Humanoid-Object Interaction via Generative Modeling and Reinforcement Learning},
author={Lin, Yuhang and Xie, Yijia and Xie, Jiahong and Huang, Yuehao and Wang, Ruoyu and Lv, Jiajun and Ma, Yukai and Zuo, Xingxing},
journal={arXiv:2508.14120},
year={2025}
}