Yash Kant

Hi! I am a fourth year CS Ph.D. student at University of Toronto, currently working on diffusion models. I am advised by Igor Gilitschenski.

Since June 2024, I am interning at Meta Reality Labs with Shunsuke Saito. Here, I am part of the GenCA (Generative Codec Avatars) team focused on training large diffusion models on studio captured humans.

I enjoy talking to people and building (hopefully useful) things together! :)

From May 2022 to Dec 2023, I worked in Sergey Tulyakov's team at Snap Research. My work SPAD was used in Snapchat’s text-to-3D pipeline, and inspired better data curation and modeling strategies.

See what I know.

Expand for more details.

At Snap Research, I trained 3D-aware diffusion models for Novel View Synthesis (iNVS) and Text-to-Views (SPAD) tasks, and created a fast character animation technique (INS).

Previously, I was a Research Visitor at Georgia Tech advised by Devi Parikh and Dhruv Batra for two years (2019-21). There, I developed a benchmark (Housekeep) to measure commonsense in LLM-based embodied AI agents; and built Visual Question Answering models that can read (SAM) and are robust (ConCAT).

Email  /  X  /  Github  /  Scholar  /  LinkedIn  /  CV

profile photo
Highlights Reviewing: CVPR, ECCV, ICCV, AAAI, NeurIPS, ACCV, SIGGRAPH, SIGGRAPH Asia.

Selected Research ( All)
SPAD: Spatially Aware Multiview Diffusers
Yash Kant, Ziyi Wu, Michael Vasilkovsky, Gordon Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov*, Igor Gilitschenski*, Aliaksandr Siarohin*
CVPR, 2024
arXiv / code / project page / news [hackernews]
We trained a spatially aware multi-view diffusion model that can generate many consistent novel views in a single forward pass given a text prompt / image!

SPAD outperforms MVDream and SyncDreamer, and enables generating 3D assets from text within 10 seconds!
Realistic Evaluation of Model Merging for Compositional Generalization
Derek Tam*, Yash Kant*, Brian Lester*, Igor Gilitschenski, Colin Raffel
Under Submission
arXiv
We systematically evaluate several model merging methods within a unified experimental framework, focusing on compositional generalization.

We explore the impact of scaling the number of merged models and sensitivity to hyper-parameters, offering a clear assessment of the current state of model merging techniques.
iNVS: Repurposing Diffusion Inpainters for Novel View Synthesis
Yash Kant, Aliaksandr Siarohin, Michael Vasilkovsky, Riza Alp Guler, Jian Ren, Sergey Tulyakov, Igor Gilitschenski
SIGGRAPH Asia, 2023
arXiv / project page
We perform novel view synthesis from a single image by repurposing Stable Diffusion inpainting model, and depth based 3D unprojection. We outperform baselines (such Zero-1-to-3) on PSNR and LPIPS metrics.

Our 3D-aware inpainting model was trained on Objaverse on 96 A100 GPUs for two weeks!
Invertible Neural Skinning
Yash Kant, Aliaksandr Siarohin, Riza Alp Guler, Menglei Chai, Jian Ren, Sergey Tulyakov, Igor Gilitschenski
CVPR, 2023
arXiv / project page
We propose an end-to-end invertible and learnable reposing pipeline that allows animating implicit surfaces with intricate pose-varying effects. We outperform the state-of-the-art reposing techniques on clothed humans while preserving surface correspondences and being order of magnitude faster!
Housekeep: Tidying Virtual Households using Commonsense Reasoning
Yash Kant, Arun Ramachandran, Sriram Yenamandra, Igor Gilitschenski, Dhruv Batra, Andrew Szot*, and Harsh Agrawal*
ECCV, 2022
arXiv / project page / code / colab / news [coc-gt, tech-org]
Housekeep is a benchmark to evaluate commonsense reasoning in the home for embodied AI. Here, an embodied agent must tidy a house by rearranging misplaced objects without explicit instructions.

To capture the rich diversity of real world scenarios, we support cluttering environments with ~1800 everyday 3D object models spread across ~270 categories!
LaTeRF: Label and Text Driven Object Radiance Fields
Ashkan Mirzaei, Yash Kant, Jonathan Kelly, and Igor Gilitschenski
ECCV, 2022
arXiv / code
We build a simple method to extract an object from a scene given 2D images, camera poses, a natural language description of the object, and a few annotated pixels of object and background.
Building Scalable Video Understanding Benchmarks through Sports
Aniket Agarwal^, Alex Zhang^, Karthik Narasimhan, Igor Gilitschenski, Vishvak Murahari*, Yash Kant*
arXiv / project page / code
We introduce an automated Annotation and Video Stream Alignment Pipeline (abbreviated ASAP) for aligning unlabeled videos of four different sports (Cricket, Football, Basketball, and American Football) with their corresponding dense annotations (commentary) freely available on the web. Our human studies indicate that ASAP can align videos and annotations with high fidelity, precision, and speed!
Contrast and Classify: Training Robust VQA Models
Yash Kant, Abhinav Moudgil, Dhruv Batra, Devi Parikh, Harsh Agrawal
ICCV, 2021
arXiv / project page / code / slides

We propose a training scheme which steers VQA models towards answering paraphrased questions consistently, and we ended up beating previous baselines by an absolute 5.8% on consistency metrics without any performance drop!

Spatially Aware Multimodal Transformers for TextVQA
Yash Kant, Dhruv Batra, Peter Anderson, Jiasen Lu, Alexander Schwing, Devi Parikh, Harsh Agrawal
ECCV, 2020
arXiv / project page / code / short talk / long talk / slides

We built a self-attention module to reason over spatial graphs in images. We ended up with an absolute performance improvement of more than 4% on two TextVQA bechmarks!


I borrowed and modified this website template from Jon Barron.