GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation

Zhenya Yang1, Zhe Liu1, Yuxiang Lu1, Liping Hou2, Chenxuan Miao1, Siyi Peng2, Bailan Feng2 Xiang Bai3 Hengshuang Zhao1
1University of Hong Kong, 2Huawei Noah's Ark Lab 3Huazhong University of Science and Technology
Teaser

GenieDrive generatively simulates future occupancy and corresponding multi-view videos given user driving controls and editing operations.

Abstract

We propose GenieDrive, a novel framework designed for physics-aware driving video generation. Our approach starts by generating 4D occupancy, which serves as a physics-informed foundation for subsequent video generation. 4D occupancy contains rich physical information, including high-resolution 3D structures and dynamics. To facilitate effective compression of such high-resolution occupancy, we propose a VAE that encodes occupancy into a latent tri-plane representation, reducing the latent size to only 58% of that used in previous methods. We further introduce Mutual Control Attention (MCA) to accurately model the influence of control on occupancy evolution, and we jointly train the VAE and the subsequent prediction module in an end-to-end manner to maximize forecasting accuracy. Together, these designs yield a 7.2% improvement in forecasting mIoU at an inference speed of 41 FPS, while using only 3.47 M parameters. Additionally, a Normalized Multi-View Attention is introduced in the video generation model to generate multi-view driving videos with guidance from our 4D occupancy, significantly improving video quality with a 20.7% reduction in FVD. Experiments demonstrate that GenieDrive enables highly controllable, multi-view consistent, and physics-aware driving video generation.

Method

Method Figure

Overall framework of GenieDrive. Our GenieDrive adopts a two-stage generation pipeline that first predicts future occupancy and then generates multi-view driving videos. In the occupancy generation stage, the current occupancy is encoded using a tri-plane VAE and processed by our Mutual Control Attention (MCA). The predicted occupancy is rendered into multi-view semantic maps, which are then fed into the DiT blocks enhanced by our Multi-View Attention (MVA) module to produce the final multi-view driving videos.

Video Results

Physics-Aware Driving Video Generation

Given same initial driving scenarios and different user driving controls (turn left, go straight and turn right), our model can generate diverse future driving videos that accurately reflect the effects of the controls.

Comparison with Previous Driving World Models

We input three driving trajectories (turn left, go straight and turn right) into Vista, Epona and our model to generate future driving videos for comparison. As shown in bellow videos, all methods effectively handle the go straight control, but only our GenieDrive generates physically plausible videos for turn left and turn right.

Driving Video Editing

We can easily remove or insert objects in occupancy space and then generate driving videos conditioned on the edited occupancy. We visualize the editing process applied to driving videos. Both Removal and Insertion operations take effect progressively over time.

Long Video Generation

Our GenieDrive-L produces 81-frame multi-view driving videos, and by applying the rollout operation, it can further generate 241-frame (~20s) sequences—the longest video length in the NuScenes dataset.

Sim-to-Real Generation

The sim-to-real gap is largely caused by the unrealistic rendering quality of the simulator. However, there is no obvious discrepancy between synthetic occupancy and real-world occupancy. Therefore, we leverage occupancy from the CARLA simulator and use our method to transfer the synthetic occupancy into realistic multi-view driving videos.

BibTeX

@article{yang2025geniedrive,
  author    = {Yang, Zhenya and Liu, Zhe and Lu, Yuxiang and Hou, Liping and Miao, Chenxuan and Peng, Siyi and Feng, Bailan and Bai, Xiang and Zhao, Hengshuang},
  title     = {GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation},
  journal   = {arXiv:2512.12751},
  year      = {2025},
}