What I know... ( Yash Kant )

This page highlights few bits (engineering and research related) that don't get mentioned in my CV, but are equally (if not more) important.

I will update this page asynchronously after I wrap up a project and distill my learnings.

Last Update: 8 Feb 2024 // Started: 11 Jan 2024

Infrastucture

Data and Training.
  • I have written high throughput datalaoders for vision based tasks using FFCV, Webdataset, and DALI to train on massive datasets (> 10 TB).
  • Training iNVS, required incorporating softmax-splatting (to create partial views) inside a data worker which led to frequent worker timeouts and frozen multi-node training runs. To make things run smoothly, I cached the entire dataset into chunked .beton files (FFCV's native format), and wrote custom dataloader to read these chunks in parallel without overlap across nodes.
  • For SPAD, to integrate epipolar attention masks within self-attention layers of Stable Diffusion, I experimented with several fast attention libraries — xformers, flash attention 2, memory efficient attention, etc.; xformers was the best!
  • I use omegaconf for config management, lightning for training in bfloat16 precision, and wandb for visualizations.
  • I love einops!

3DV.

Finetuning 2D Diffusers for 3DV.
  • How to inject 3D understanding? Here are the options I know of:
    • Inject camera conditioning via 1D/2D features: Zero123
      • Global features are captured well, but cannot gurantee of 3d consistent outputs or preserve details such as text/texture.
    • Use an implicit 3D latent space: GeNVS / SyncDreamer
      • Strong results and preserves 3D structure to a good extent, but expensive training and inference. I look forward to reading works that scale such methods!
    • Apply explicit depth-based warping and inpaint: iNVS / Text2Room
      • 3d consistent across views, and preserves details such as text/texture. But precise monocular depth is hard to obtain and becomes point of failure.
    • Add constraints using epipolar geometry and positional encodings: SPAD / MVDiffusion
      • Preserves 3d consistency and features to a good extent. But restricting attention masks to epipolar lines could be suboptimal and difficult to optimize via libraries (eg, flash-attention).
  • Quality of Data >> Quantity of Data (for 3D finetuning):
    • Many works in Vision [EMU, Zero123-XL] and Language [Instruction-Tuning] showed that finetuning base models on tiny amounts of high-quality data can bring enormous gains.
    • This is true for 3DV too, in fact SPAD was trained on 25% of total Objaverse; still it outperforms models that were trained longer and with full Objaverse (such as iNVS)!
    • For SPAD, I filtered Objaverse on metadata such as likes, comments, view counts by users.
    • Similar trend occured in Zero123-XL, where naive scaling (~ 10X-ing) the data and compute didn't help, but better curation did.
  • Too much synthetic data (eg — Objaverse) can hurt!
    • Objaverse is not as high quality as LAION Aesthetics, and as we train longer on it the model becomes more crude (and forgets its photorealism). Due to this Instant3D only trained their model on 10K assets for 10K steps!
    • Solid color backgrounds cause trouble.
      • This work showed that SD cannot generate images with solid color background, and I found Objaverse-finetuned SD struggles with same issue.
      • I fixed this (somewhat hackily) with inference tricks: a) using partial view for inital denoising steps in iNVS, and b) gaussian blob initialization in SPAD (borrowed from Instant3D).
      • After playing for hours attempting to fix these issues, I feel strongly that inference in DMs remains complicated (eg: we should get rid of classifier-free guidance, etc)!
      • Consistency / Distilled Models seem promising! I look forward to playing more with them. :)
      • Fooocus documents hidden inference tricks for vanilla 2D DMs!
  • Autoencoders (and Visual Tokenizers) for DMs still need lots of work.
    • VAEs in SD (1/2.x) do not preserve details (eg, human faces or text) during reconstruction.
    • SDXL was a significant improvement in preserving high-quality details.
    • EMU (from Meta) and Consistency Decoder (from OpenAI) improve VAEs for 2D DMs.
    • I think we can borrow more lessons from video codecs.
  • Diffusion losses are not the best indicators of progress.
    • Training loss curves in DMs decline in the intial phase (first ~ 50K steps), and then flatten out.
    • This gives an impression that our model has stagnated or that LR should be decreased further.
    • However, the performance metrics (FID / CLIP) keep improving. This may have something to do with the loss formulation. I plan to read more about this in future!
  • Adding controls to Diffusion Models.
    • LoRA / ControlNet would be the best (and easiest).
    • Avoid adding parameters that:
      • Disturb the original computation flow of a pretrained SD model.
      • Are initialized from scratch.
    • However, if you really need to, consider the following ideas:
      • If new parameters lie midway within the network, initialize them to identity (zero weights and bias on last layer).
      • Start simple (MLPs no activations), and add complexity (activations / attention) iteratively.
      • If new parameters come after pretrained model, bias intiliazation towards output distribution. (see, SigLiP Section 3.2).
      • Consider using a higher LR for the new parameters compared to pretrained network.