Geo4D Icon

Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction

1Visual Geometry Group, University of Oxford, 2Naver Labs Europe
ArXiv 2025

Geo4D repurposes a video diffusion model for monocular 4D reconstruction.

Abstract

We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes. By leveraging the strong dynamic prior captured by such video models, Geo4D can be trained using only synthetic data while generalizing well to real data in a zero-shot manner. Geo4D predicts several complementary geometric modalities, namely point, depth, and ray maps. It uses a new multi-modal alignment algorithm to align and fuse these modalities, as well as multiple sliding windows, at inference time, thus obtaining robust and accurate 4D reconstruction of long videos. Extensive experiments across multiple benchmarks show that Geo4D significantly surpasses state-of-the-art video depth estimation methods, including recent methods such as MonST3R, which are also designed to handle dynamic scenes.

Video

Method

Overview of Geo4D. During training, video conditions are injected by locally concatenating the latent feature of the video with diffused geometric features \( \mathbf{z}_t^{\mathbf{X}}, \mathbf{z}_t^{\mathbf{D}}, \mathbf{z}_t^{\mathbf{r}} \) and are injected globally via cross-attention in the denoising U-Net, after CLIP encoding and a query transformer. During inference, iteratively denoised latent features \( \hat{\mathbf{z}}_0^{\mathbf{X}}, \hat{\mathbf{z}}_0^{\mathbf{D}}, \hat{\mathbf{z}}_0^{\mathbf{r}} \) are decoded by the fine-tuned VAE decoder, followed by multi-modal alignment optimization for coherent 4D reconstruction.

Result

Interactive 4D Visualization

Left Click Drag with left click to rotate view
Scroll Wheel Scroll to zoom in/out
Right Click Drag with right click to move view
W S Moving forward and backward
A D Moving left and right
Q E Moving upward and downward

Note: Results are downsampled 4 times for efficient online rendering. Also, we do not mask out any points for fair comparison.

Comparison

Attribute to our group-wise inference manner and prior geometry knowledge from pretrained video diffusion model, our model successfully produces consistent 4D geometry under fast motion. For more comparisons, please visit the comparison page.

MonST3R
Ours

More Qualitative Results

Our method generalizes to various scenes with different 4D objects and performs robustly against different camera and object motion. For more results, please visit the result page.

More Qualitative Results on Video Depth Estimation

Our method achieves state-of-the-art performance in video depth estimation and produces temporally consistent, highly detailed depth maps for diverse in-the-wild sequences.

Acknowledgment

Zeren Jiang was supported by Clarendon Scholarship. This work were also supported by ERC s-UNION and EPSRC EP/Z001811/1 SYN3D.

We thank Junyi Zhang for discussing the experiments of MonST3R with us.

We also thank Stanislaw Szymanowicz, Ruining Li, Runjia Li, Jianyuan Wang, Minghao Chen, Jinghao Zhou, Gabrijel Boduljak and Xingyi Yang for helpful suggestions and discussions.

BibTeX

@misc{Geo4D,
      title={Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction}, 
      author={Jiang, Zeren and Zheng, Chuanxia and Laina, Iro and Larlus, Diane and Vedaldi, Andrea},
      year={2025},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
  }