We propose GenieDrive, a novel framework designed for physics-aware driving video generation. Our approach starts by generating 4D occupancy, which serves as a physics-informed foundation for subsequent video generation. 4D occupancy contains rich physical information, including high-resolution 3D structures and dynamics. To facilitate effective compression of such high-resolution occupancy, we propose a VAE that encodes occupancy into a latent tri-plane representation, reducing the latent size to only 58% of that used in previous methods. We further introduce Mutual Control Attention (MCA) to accurately model the influence of control on occupancy evolution, and we jointly train the VAE and the subsequent prediction module in an end-to-end manner to maximize forecasting accuracy. Together, these designs yield a 7.2% improvement in forecasting mIoU at an inference speed of 41 FPS, while using only 3.47 M parameters. Additionally, a Normalized Multi-View Attention is introduced in the video generation model to generate multi-view driving videos with guidance from our 4D occupancy, significantly improving video quality with a 20.7% reduction in FVD. Experiments demonstrate that GenieDrive enables highly controllable, multi-view consistent, and physics-aware driving video generation.
Given same initial driving scenarios and different user driving controls (turn left, go straight and turn right), our model can generate diverse future driving videos that accurately reflect the effects of the controls.
We input three driving trajectories (turn left, go straight and turn right) into Vista, Epona and our model to generate future driving videos for comparison. As shown in bellow videos, all methods effectively handle the go straight control, but only our GenieDrive generates physically plausible videos for turn left and turn right.
We can easily remove or insert objects in occupancy space and then generate driving videos conditioned on the edited occupancy. We visualize the editing process applied to driving videos. Both Removal and Insertion operations take effect progressively over time.
Our GenieDrive-L produces 81-frame multi-view driving videos, and by applying the rollout operation, it can further generate 241-frame (~20s) sequences—the longest video length in the NuScenes dataset.
The sim-to-real gap is largely caused by the unrealistic rendering quality of the simulator. However, there is no obvious discrepancy between synthetic occupancy and real-world occupancy. Therefore, we leverage occupancy from the CARLA simulator and use our method to transfer the synthetic occupancy into realistic multi-view driving videos.
@article{yang2025geniedrive,
author = {Yang, Zhenya and Liu, Zhe and Lu, Yuxiang and Hou, Liping and Miao, Chenxuan and Peng, Siyi and Feng, Bailan and Bai, Xiang and Zhao, Hengshuang},
title = {GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation},
journal = {arXiv:2512.12751},
year = {2025},
}