Pippo : High-Resolution Multi-View Humans from a Single Image
Under Submission


Pippo generates 1K resolution, multi-view, studio-quality images from a single photo in a one forward pass.
Pippo takes as input a full-body or face-only photo, can blend the input with novel generated content well.

TLDR;

Pippo is a 1K Multiview Diffusion Transformer:
  • pre-trained on 3B Human images
  • post-trained on 2.5K studio captures
  • with pixel-aligned control via ControlMLP
  • generates > 5x views at inference with Attention Biasing
  • better 3D evaluation metric — Re-projection Error

Method and Training: Pippo

pipeline

Model pipeline. (Left) we use data from a studio capture and train our multi-view diffusion model (Right). We condition on a full reference photo and a cropped face, as well as the target view cameras and 2D projected spatial anchor indicating head position and orientation. Spatial anchor is only used for training, and is fixed to an arbitrary position during inference.



Contributions: ControlMLP and Attention Biasing

sub-modules


Diffusion Transformer with ControlMLP. We use a DiT modulated using a ControlMLP (lightweight ControlNet-style module). ControlMLP block is used to inject pixel-aligned conditions such as Plücker Rays and Spatial Anchor within the DiT.

Attention Biasing. Entropy (Y-axis) vs. Scaling Factor Growth (X-axis) across varying number of tokens (different plots). Using our proposed fix leads to better entropy attenuation.

compare

Attention Biasing Visuals. We use attention biasing formulation from prior work and introduce a Growth Factor hyperparameter (γ) set to a range of 1.4-1.6 for optimal entropy attenuation (shown above).


All the visuals below are from unseen subjects, and were generated in a single forward pass.
Please click on the left and right arrow keys to scroll on the visuals, and click on the video to play it.

Face-only: Turnarounds from a Single Image

Left: We extract a tight Face-crop from an iPhone clicked photo; Right: Generated dense turnaround video (36 frames) at 512x512.


Left: We extract a tight Face-crop from an iPhone clicked photo; Right: Generated short turnaround (16 frames) video at 1024x1024 resolution.

Full-body: Turnarounds from a Single Image

Left: Full-body photo clicked with an iPhone; Right: Generated short turnaround (16 frames) video at 1024x1024 resolution.


Left: Full-body studio image; Right: Generated close-up short turnaround video (14 frames) at 512x512.

Head-only: Turnarounds from a Single Image

Left: Head-only studio photo as input; Right: Generated dense turnaround video (36 frames) at 512x512.

Full-body: Multi-view Video from Monocular Video (with Ground Truth)

Top Row: Ground Truth.
Bottom Row Left [Column 1]: Monocular input video of the subject moving at 512x512.
Bottom Row Right [Columns 2-7]: Generated multi-view video by running Pippo independently on each frame.

Note: Pippo auto-completes missing details (e.g. shoes, or face) for each frame with wide variety of possibilities!

Head-only: Multi-view Video from Monocular Video (with Ground Truth)

Top Row: Ground Truth.
Bottom Row Left [Column 1]: Monocular input video of the subject talking at 512x512.
Bottom Row Right [Columns 2-7]: Generated multi-view video by running Pippo independently on each frame.

Note: Pippo auto-completes missing regions (e.g. neck or clothing) for each frame with a wide variety of possibilities!

Full-body & Head-only: Spatial Anchor Visualization

Top Row: Full-body generations with corresponding fixed 3D Spatial Anchor.
Bottom Row: Head-only generations with corresponding fixed 3D Spatial Anchor.


Citation



We thank Michaël Gharbi and the Nerfies authors (Nerfies template) for their website templates.