unPIC: A Geometric Multiview Prior

for Image to 3D Synthesis

(under review)

Abstract

We introduce a hierarchical probabilistic approach to go from a 2D image to multiview 3D: a diffusion "prior" predicts the unseen 3D geometry, which then conditions a diffusion "decoder" to generate novel views of the subject. We use a pointmap-based geometric representation to coordinate the generation of multiple target views simultaneously. We construct a predictable distribution of geometric features per target view to enable learnability across examples, and generalization to arbitrary inputs images. Our modular, geometry-driven approach to novel-view synthesis (called "unPIC") beats competing baselines such as CAT3D, EscherNet, Free3D, and One-2-3-45 on held-out objects from ObjaverseXL, as well as unseen real-world objects from Google Scanned Objects, Amazon Berkeley Objects, and the Digital Twin Catalog.

Google Scanned Objects (out of distribution)

Amazon Berkeley Objects (out of distribution)

Digital Twin Catalog (out of distribution)

Objaverse-XL holdouts

GQA real images (out of distribution)

Diversity of Outputs (from a single image)

All visualization inputs beyond the source image (e.g., alpha masks, depth component of CROCS) are generated by our model.
The camera icons on each video (top-left) indicate the following: a red camera indicates the source frame, and a green camera indicates a particular target view.