Pippo : High-Resolution Multi-View Humans from a Single Image
Under Submission
- Yash Kant 1,2,3
- Ethan Weber 1,4
- Jin Kyu Kim 1
- Rawal Khirodkar 1
- Su Zhaoen 1
- Julieta Martinez 1
- * Igor Gilitschenski 2,3
- * Shunsuke Saito 1
- * Timur Bagautdinov 1
- 1 Meta Reality Labs
- 2 University of Toronto
- 3 Vector Institute
- 4 UC Berkeley
Pippo generates 1K resolution, multi-view, studio-quality images from a single photo in a one forward pass.
Pippo takes as input a full-body or face-only photo, can blend the input with novel generated content well.
TLDR;
- pre-trained on 3B Human images
- post-trained on 2.5K studio captures
- with pixel-aligned control via ControlMLP
- generates > 5x views at inference with Attention Biasing
- better 3D evaluation metric — Re-projection Error
Method and Training: Pippo
![pipeline](img/assets/pippo-pipeline.jpg)
Model pipeline. (Left) we use data from a studio capture and train our multi-view diffusion model (Right). We condition on a full reference photo and a cropped face, as well as the target view cameras and 2D projected spatial anchor indicating head position and orientation. Spatial anchor is only used for training, and is fixed to an arbitrary position during inference.
Contributions: ControlMLP and Attention Biasing
![sub-modules](img/assets/sub-modules.png)
Diffusion Transformer with ControlMLP.
We use a DiT modulated using a ControlMLP (lightweight ControlNet-style module). ControlMLP block is used to inject pixel-aligned
conditions such as Plücker Rays and Spatial Anchor within the DiT.
Attention Biasing.
Entropy (Y-axis) vs. Scaling Factor Growth
(X-axis) across varying number of tokens (different plots). Using our proposed fix leads to better entropy attenuation.
Face-only: Turnarounds from a Single Image
Left: We extract a tight Face-crop from an iPhone clicked photo; Right: Generated dense turnaround video (36 frames) at 512x512.
Left: We extract a tight Face-crop from an iPhone clicked photo; Right: Generated short turnaround (16 frames) video at 1024x1024 resolution.
Full-body: Turnarounds from a Single Image
Left: Full-body photo clicked with an iPhone; Right: Generated short turnaround (16 frames) video at 1024x1024 resolution.
Left: Full-body studio image; Right: Generated close-up short turnaround video (14 frames) at 512x512.
Head-only: Turnarounds from a Single Image
Left: Head-only studio photo as input; Right: Generated dense turnaround video (36 frames) at 512x512.
Full-body: Multi-view Video from Monocular Video (with Ground Truth)
Top Row: Ground Truth.
Bottom Row Left [Column 1]: Monocular input video of the subject moving at
512x512.
Bottom Row Right [Columns 2-7]: Generated multi-view video by running Pippo
independently on each frame.
Note: Pippo auto-completes missing details (e.g. shoes, or face) for each frame
with wide variety of possibilities!
Head-only: Multi-view Video from Monocular Video (with Ground Truth)
Top Row: Ground Truth.
Bottom Row Left [Column 1]: Monocular input video of the subject talking at
512x512.
Bottom Row Right [Columns 2-7]: Generated multi-view video by running Pippo
independently on each frame.
Note: Pippo auto-completes missing regions (e.g. neck or clothing) for each
frame with a wide variety of possibilities!
Full-body & Head-only: Spatial Anchor Visualization
Top Row: Full-body generations with corresponding fixed 3D Spatial Anchor.
Bottom Row: Head-only generations with corresponding fixed 3D Spatial Anchor.
Citation
We thank Michaël Gharbi and the Nerfies authors (Nerfies template) for their website templates.