Probabilistic Inverse Cameras:

Image to 3D via Multiview Geometry

CVPR 2025 (under review)

Abstract

We introduce a hierarchical probabilistic approach to go from a 2D image to multiview 3D: a diffusion “prior” models the unseen 3D geometry, which then conditions a diffusion “decoder” to generate novel views of the subject. We use a pointmap-based geometric representation in a multiview image format to coordinate the generation of multiple target views simultaneously. We facilitate correspondence between views by assuming fixed target camera poses relative to the source camera, and constructing a predictable distribution of geometric features per target. Our modular, geometry-driven approach to novel-view synthesis (called “unPIC”) beats SoTA baselines such as CAT3D and One-2-3-45 on held-out objects from ObjaverseXL, as well as real-world objects ranging from Google Scanned Objects, Amazon Berkeley Objects, to the Digital Twin Catalog.

Google Scanned Objects (out of distribution)

Amazon Berkeley Objects (out of distribution)

Digital Twin Catalog (out of distribution)

Objaverse-XL holdouts

GQA real images (out of distribution)

Diversity of Outputs (from a single image)

All visualization inputs beyond the source image (e.g., alpha masks, depth component of CROCS) are generated by our model.
The camera icons on each video (top-left) indicate the following: a red camera indicates the source frame, and a green camera indicates a particular target view.