SimGenHOI: Physically Realistic Whole-Body Humanoid-Object Interaction
via Generative Modeling and Reinforcement Learning

# Corresponding Author

1Institute of Cyber-Systems and Control, Zhejiang University, 2Department of Robotics, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)

TL;DR: This work introduces SimGen-HOI, a unified framework that improves the quality of generated actions and the tracking success rate of humanoids interacting with objects in simulated environments.

Abstract

Generating physically realistic humanoid–object interactions (HOI) is a fundamental challenge in robotics. Existing HOI generation approaches, such as diffusion-based models, often suffer from artifacts such as implausible contacts, penetrations, and unrealistic whole-body actions, which hinder successful execution in physical environments. To address these challenges, we introduce SimGenHOI , a unified framework that combines the strengths of generative modeling and reinforcement learning to produce controllable and physically plausible HOI. Our HOI generative model, based on Diffusion Transformers (DiT), predicts a set of key actions conditioned on text prompts, object geometry, sparse object waypoints, and the initial humanoid pose. These key actions capture essential interaction dynamics and are interpolated into smooth motion trajectories, naturally supporting long-horizon generation. To ensure physical realism, we design a contact-aware whole-body control policy trained with reinforcement learning, which tracks the generated motions while correcting artifacts such as penetration and foot sliding. Furthermore, we introduce a mutual fine-tuning strategy, where the generative model and the control policy iteratively refine each other, improving both motion realism and tracking robustness. Extensive experiments demonstrate that SimGenHOI generates realistic, diverse, and physically plausible humanoid–object interactions, achieving significantly higher tracking success rates in simulation and enabling long-horizon manipulation tasks.

Render in Blender

Image 1
Image 2
Image 3

Pipeline

Our proposed framework uses a diffusion model for key action generation and reinforcement learning to train a contact-aware HOI policy in the Isaac Gym simulator. The proposed framework, as depicted in (c), mainly consists of two components, including: (a) HOI generation conditioned on text prompt, object geometry, sparse waypoints of objects, and initial humanoid pose. (b) Contact-aware humanoid-object interaction (HOI) policy; \( \hat G \), \( q^{obj}_{t+1:t+n}\) and \( q_t^{obj}\) represent object's geometry, future trajectory and its current state. \( \hat c_{t+1}\), \( s^p_t\) and \(\hat s^p_{t+1}\) denote the contact guidance, current state of the humanoid, and its next state reference.

Experimental Results

Lift the largetable, so only two legs are off the floor. Slide your feet and rotate the largetable as you slide.
Lift the monitor, move the monitor, and put down the monitor.
Lift the plasticbox, move the plasticbox, and put down the plasticbox.
Facing the back of the woodchair, lift the woodchair, move the woodchair, and then place the woodchair on the floor.
GT
CHOIS
Ours
Ours_200
Ours w/o
weight

We compare our method against the existing CHOIS approach [Li et al., ECCV 2024]. Our full model uses 1000 denoising steps, while Ours_200 is a variant with only 200 steps. Ours w/o weight refers to the version where joint weighting is removed during key action extraction.

Video

BibTeX

@article{lin2025simgenhoi,
  title={SimGenHOI: Physically Realistic Whole-Body Humanoid-Object Interaction via Generative Modeling and Reinforcement Learning}, 
  author={Lin, Yuhang and Xie, Yijia and Xie, Jiahong and Huang, Yuehao and Wang, Ruoyu and Lv, Jiajun and Ma, Yukai and Zuo, Xingxing},
  journal={arXiv:2508.14120}, 
  year={2025}
}