RealisMotion: Decomposed Human Motion Control and Video Generation in the World Space
Jingyun Liang, Jingkai Zhou, Shikai Li, Chenjie Cao, Lei Sun, Yichen Qian, Weihua Chen, Fan Wang
DAMO Academy, Alibaba
This repository is the official implementation of RealisMotion: Decomposed Human Motion Control and Video Generation in the World Space.
Gallery
Here are several example videos generated by RealisMotion. Note that the GIFs shown here have some degree of visual quality degradation. Please visit our project page for more than 100 videos examples.
![]() |
![]() |
||
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Generating human videos with realistic and controllable motions is a challenging task. While existing methods can generate visually compelling videos, they lack separate control over four key video elements: foreground subject, background video, human trajectory and action patterns. In this paper, we propose a decomposed human motion control and video generation framework that explicitly decouples motion from appearance, subject from background, and action from trajectory, enabling flexible mix-and-match composition of these elements. Concretely, we first build a ground-aware 3D world coordinate system and perform motion editing directly in the 3D space. Trajectory control is implemented by unprojecting edited 2D trajectories into 3D with focal-length calibration and coordinate transformation, followed by speed alignment and orientation adjustment; actions are supplied by a motion bank or generated via text-to-motion methods. Then, based on modern text-to-video diffusion transformer models, we inject the subject as tokens for full attention, concatenate the background along the channel dimension, and add motion (trajectory and action) control signals by addition. Such a design opens up the possibility for us to generate realistic videos of anyone doing anything anywhere. Extensive experiments on benchmark datasets and real-world cases demonstrate that our method achieves state-of-the-art performance on both element-wise controllability and overall video quality.
![]()
Quick Start
1. Setup Repository and Environment
git clone https://github.com/jingyunliang/RealisMotion.git
cd RealisMotion
conda create -n realismotion python=3.10
conda activate realismotion
pip install -r requirements.txt
# install FA3
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout 0dfb28174333d9eefb7c1dd4292690a8458d1e89 # Important: using other FA3 might yield bad results on H20 GPUs
cd hopper
python setup.py install
cd ../../
2. Download Checkpoints
We provide two versions for inference: the first is the text-to-video (T2V) version (same as the model in the paper), the second is the image-to-video (I2V) version (to avoid duplicate work, we directly combine with the concurrent work RealisDance-Dit).
| Version Type | Advantage | Disadvantage |
|---|---|---|
| Text-to-Video (T2V) version |
|
|
| Image-to-Video (I2V) version |
|
|
Please download the checkpoints as below. Use HF_ENDPOINT=https://hf-mirror.com huggingface-cli xxxx if you need to speed up downloading.
| Version Type | Bash Command |
|---|---|
| T2V |
|
| I2V |
|
3. Quick Inference
- Inference with Single GPU
| Version Type | Bash Command |
|---|---|
| T2V |
|
| I2V |
|
Note: add --enable-teacache to inference with TeaCache for acceleration (optional, may cause quality degradation); add --save-gpu-memory to inference with small GPU memory (optional, will be super slow. Can be used with TeaCache).
- Inference with multi GPUs (Optional. Can be used with TeaCache)
| Version Type | Bash Command |
|---|---|
| T2V |
|
| I2V |
|
4. Custom Batch Inference
| Version Type | Bash Command |
|---|---|
| T2V |
|
| I2V |
The YOUR_ROOT_PATH dir should be structured as:
|
5. Motion Editing and Guidance Condition Rendering
To edit the trajectory, orientation and action of human, please follow followng steps.
1. Setup Environment
First, please install GVHMR and DPVO following install GVHMR. The nvcc in third-party/DPVO/setup.py should be modified as ['-O3', '-gencode', 'arch=compute_90,code=sm_90'] for H20 GPUs.
Then, please install DepthPro for focal length calibration as follows (optional).
git clone https://github.com/apple/ml-depth-pro
cd ml-depth-pro
pip install .
source get_pretrained_models.sh
cd ..
2. SMPL-X Estimation
We first estimate the SMPL-X for the input foreground subject, background and motion videos / images.
cd RealisMotion
export PYTHONPATH="/mnt_video/jingyun.ljy/code/GVHMR/hmr4d:$PYTHONPATH"
# process foreground and background
python hmc/render_demo.py --video=inputs/example_video/internalaffairs.mp4 --output_root inputs/demo --track_id 1
# process motion
python hmc/render_demo.py --video=inputs/example_video/falldown.mp4 --output_root inputs/demo
By default, --track_id is set as 0 to track the first person. Use -s for static background. When you only have an image, turn it to a video first as below.
ffmpeg -loop 1 -i inputs/example_video/YOUR_IMAGE.jpg -c:v libx264 -preset veryslow -crf 0 -t 1 -pix_fmt yuv420p -vf "fps=25,scale=trunc(iw/2)*2:trunc(ih/2)*2" inputs/example_video/YOUR_VIDEO.mp4 -y
3. Motion/Background/Subject Editing
To edit the motion, you need to specify the background path, the motion path, and the reference foreground path. We currently provide four examples for different usecases.
# example 1: affine transformation
export PYTHONPATH="/mnt_video/jingyun.ljy/code/GVHMR/hmr4d:$PYTHONPATH"
python hmc/realismotion_render_demo.py \
--video=inputs/example_video/internalaffairs.mp4 \
--motion_path inputs/motion_bank/falldown \
--reference_path inputs/demo/internalaffairs \
--output_root inputs/demo \
--window_size 1 \
--repeat_smpl 0 50 1 \
--pause_at_begin 50 \
--pause_at_end 105 \
--edit_type affine_transform\
--affine_transform_args 0 0 0 -0.3
# example 2: move according to any trajectory
export PYTHONPATH="/mnt_video/jingyun.ljy/code/GVHMR/hmr4d:$PYTHONPATH"
python hmc/realismotion_render_demo.py \
--video=inputs/example_video/justin.mp4 \
--motion_path inputs/motion_bank/tstageboy \
--reference_path inputs/demo/justin \
--output_root inputs/demo \
--window_size 25 \
--repeat_smpl 18 51 8 \
--edit_type edit_trajectory \
--edit_trajectory_args 815 500 952 1050 1192 502 \
-s
python hmc/realismotion_render_demo.py \
--video=inputs/example_video/male.mp4 \
--motion_path inputs/motion_bank/tstageboy \
--reference_path inputs/demo/male \
--output_root inputs/demo \
--window_size 25 \
--repeat_smpl 18 51 12 \
--edit_type edit_trajectory \
--edit_trajectory_args 1107 48 1098 662 1347 486 1154 303 952 682 1136 832 1100 494 910 787 \
--append circle \
-s
# example 3: move according to a heart-shape trajectory
export PYTHONPATH="/mnt_video/jingyun.ljy/code/GVHMR/hmr4d:$PYTHONPATH"
python hmc/realismotion_render_demo.py \
--video=inputs/example_video/male.mp4 \
--motion_path inputs/motion_bank/tstageboy \
--reference_path inputs/demo/male \
--output_root inputs/demo \
--window_size 25 \
--repeat_smpl 18 51 12 \
--edit_type edit_trajectory_as_heart \
--edit_trajectory_as_heart_args 1 2 \
--append heart \
-s
# example 4: off-the-ground kickoff demo
export PYTHONPATH="/mnt_video/jingyun.ljy/code/GVHMR/hmr4d:$PYTHONPATH"
python hmc/realismotion_render_demo.py \
--video=inputs/example_video/male.mp4 \
--motion_path inputs/demo/male \
--reference_path inputs/demo/male \
--output_root inputs/demo \
--window_size 25 \
--pause_at_begin 200 \
--edit_type edit_trajectory_kickoff \
--edit_trajectory_kickoff_args 546 515 277 510 186 752 311 929 282 601 45 687 \
--speed_ratio 20 \
--append kickoff \
-s
For kid, add --kid 1.0. One can use a float number between 0 and 1 to interpolate between adult and kid.
The human mask and hamer (hand pose) are optional, but providing them could improve the video quality. To obtain the human mask, one can install MatAnyone locally or use this MatAnyone Online Demo. Without the human mask, we will extract one from the SMPL-X depth. To obtain the hamer, please refer to Hamer Preparation. Without hamer, we will use the standard hand pose in SMPL-X.
Disclaimer
This project is released for academic use. We disclaim responsibility for user-generated content.
Citation
@article{liang2025realismotion,
title={RealisMotion: Decomposed Human Motion Control and Video Generation in the World Space},
author={Liang, Jingyun and Zhou, Jingkai and Li, Shikai and Cao, Chenjie and Sun, Lei and Qian, Yichen and Chen, Weihua and Wang, Fan},
journal={arXiv preprint arXiv:2508.08588},
year={2025}
}
Acknowledgement
We thank the authors of WHAM, 4D-Humans, and ViTPose-Pytorch for their great works, without which our project/code would not be possible.
- Downloads last month
- 2









