SPAD: Spatially Aware Multiview Diffusers

SPAD : Spatially Aware Multiview Diffusers
Accepted to CVPR, 2024

Yash Kant
UofToronto & Vector
Ziyi Wu
UofToronto & Vector
Michael Vasilkovsky
Snap Research
Guocheng Qian
KAUST
Jian Ren
Snap Research

Riza Alp Guler
Snap Research
Bernard Ghanem
KAUST
Sergey Tulyakov*
Snap Research
Igor Gilitschenski*
UofToronto & Vector
Aliaksandr Siarohin*
Snap Research

tl;dr

Citation

@misc{kant2024spad,
      title={SPAD : Spatially Aware Multiview Diffusers}, 
      author={Yash Kant and Ziyi Wu and Michael Vasilkovsky and Guocheng Qian and Jian Ren and Riza Alp Guler and Bernard Ghanem and Sergey Tulyakov and Igor Gilitschenski and Aliaksandr Siarohin},
      year={2024},
      eprint={2402.05235},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Overview: SPAD

Given a text prompt, our method synthesizes 3D consistent views of the same object. Our model can generate many images from arbitrary camera viewpoints, while being trained only on four views. Here, we show eight views sampled uniformly at a fixed elevation generated in a single forward pass.

Method: SPAD

Model pipeline. (a) We fine-tune a pre-trained text-to-image diffusion model on multi-view rendering of 3D objects.
(b) Our model jointly denoises noisy multi-view images conditioned on text and relative camera poses. To enable cross-view interaction, we apply 3D self-attention by concatenating all views, and enforce epipolar constraints on the attention map.
(c) We further add Plücker Embedding to the attention layers as positional encodings, to enhance camera control.

Epipolar Attention (Left). For each source point S on a feature map, we compute its epipolar lines on all other views. S will only attend to features along these lines plus all the points on itself (blue points).
Illustration of one block in SPAD (Right). We add Plücker Embedding to feature maps in the self-attention layer by inflating the original QKV projection layers with zero projections.

Text-to-3D Generation: Multi-view Triplane Generator

3D-Triplane-NeRF

Similar to Instant3D, we train a multi-view to NeRF generator. We use four multi-view generations from SPAD (shown in bottom half) as input to generator and create 3D assets. The entire generation takes ~ 10 seconds.

Text-to-3D Generation: Multiview SDS

Thanks to our 3D consistent multi-view generation, we can leverage the multi-view Score Distillation Sampling (SDS) for 3D asset generation. We integrate SPAD into threestudio and follow the training setting of MVDream to train a NeRF.

Quantitative Result: Novel View Synthesis

To evaluate the 3D consistency of our method, we adapt SPAD to the image-conditioned novel view synthesis task. We test on unseen 1,000 Objaverse objects, and all objects from the Google Scanned Objects (GSO) dataset.

NVS

SPAD preserves structural and perceptual details faithfully. Our method achieves competitive results on PSNR and SSIM, which setting new state-of-the-art on LPIPS.

Qualitative Result: Comparison with MVDream

compare

Qualitative Result: Close View Generation

We demonstrate smooth transition between views by generate close viewpoints each varying by 10-degrees along azimuth.

close_view_10deg

Qualitative Result: Ablation Study

We ablate our method on various design choices to demonstrate their importance.

ablation

Epipolar Attention promotes better camera control in SPAD. Directly applying 3D self-attention on all the views leads to content copying between generated images, as highlighted by the blue circles.
Plücker Embeddings help prevent generation of flipped views. Without Plücker Embeddings, the model sometimes predict image regions that are rotated by 180°, as highlighted by the red circles, due to the ambiguity in epipolar lines.

The website template was borrowed from Michaël Gharbi.