SPAD : Spatially Aware Multiview Diffusers
Accepted to CVPR, 2024
-
Yash Kant
UofToronto & Vector -
Ziyi Wu
UofToronto & Vector -
Michael Vasilkovsky
Snap Research -
Guocheng Qian
KAUST -
Jian Ren
Snap Research -
Riza Alp Guler
Snap Research -
Bernard Ghanem
KAUST -
Sergey Tulyakov*
Snap Research -
Igor Gilitschenski*
UofToronto & Vector -
Aliaksandr Siarohin*
Snap Research
tl;dr
Citation
Overview: SPAD
Given a text prompt, our method synthesizes 3D consistent views of the same object. Our model can generate many images from arbitrary camera viewpoints, while being trained only on four views. Here, we show eight views sampled uniformly at a fixed elevation generated in a single forward pass.
Method: SPAD
Model pipeline.
(a) We fine-tune a pre-trained text-to-image diffusion model on multi-view rendering of 3D objects.
(b) Our model jointly denoises noisy multi-view images conditioned on text and relative camera poses.
To enable cross-view interaction, we apply 3D self-attention by concatenating all views, and enforce epipolar constraints on the attention map.
(c) We further add Plücker Embedding to the attention layers as positional encodings, to enhance camera control.
Epipolar Attention (Left).
For each source point S on a feature map, we compute its epipolar lines on all other views.
S will only attend to features along these lines plus all the points on itself (blue points).
Illustration of one block in SPAD (Right).
We add Plücker Embedding to feature maps in the self-attention layer by inflating the original QKV projection layers with zero projections.
Text-to-3D Generation: Multi-view Triplane Generator
Text-to-3D Generation: Multiview SDS
Thanks to our 3D consistent multi-view generation, we can leverage the multi-view Score Distillation Sampling (SDS) for 3D asset generation. We integrate SPAD into threestudio and follow the training setting of MVDream to train a NeRF.Quantitative Result: Novel View Synthesis
To evaluate the 3D consistency of our method, we adapt SPAD to the image-conditioned novel view synthesis task. We test on unseen 1,000 Objaverse objects, and all objects from the Google Scanned Objects (GSO) dataset.
Qualitative Result: Comparison with MVDream
Qualitative Result: Close View Generation
We demonstrate smooth transition between views by generate close viewpoints each varying by 10-degrees along azimuth.
Qualitative Result: Ablation Study
We ablate our method on various design choices to demonstrate their importance.
Plücker Embeddings help prevent generation of flipped views. Without Plücker Embeddings, the model sometimes predict image regions that are rotated by 180°, as highlighted by the red circles, due to the ambiguity in epipolar lines.
The website template was borrowed from Michaël Gharbi.