Yash Kant

Hi! I am a Ph.D. student in Department of Computer Science and Robotics Group at University of Toronto. I am advised by Igor Gilitschenski.

I am working (part-time) at Snap Research in Sergey Tulyakov's team. Here, I am trying to build neural representations for deformable 3D objects.

Previously, I was a Research Visitor at Georgia Tech advised by Devi Parikh and Dhruv Batra for two years. There, I built Visual Question Answering models that can read and are robust, and a benchmark to measure commonsense in embodied AI agents.

I enjoy talking to people and building (hopefully useful) things together. :)

These days, I also work closely with MS/BS students, and have spot(s) for students with sufficient time and motivation. To get in touch, please send an email!

Email  /  CV  /  Github  /  Google Scholar  /  Twitter  /  LinkedIn

profile photo
Invertible Neural Skinning
Yash Kant, Aliaksandr Siarohin, Riza Alp Guler, Menglei Chai, Jian Ren, Sergey Tulyakov, Igor Gilitschenski
Coming Soon! (awaiting patent approval)
We propose an end-to-end invertible and learnable reposing pipeline that allows animating implicit surfaces with intricate pose-varying effects. We outperform the state-of-the-art reposing techniques on clothed humans while preserving surface correspondences and being order of magnitude faster!
Building Scalable Video Understanding Benchmarks through Sports
Aniket Agarwal^, Alex Zhang^, Karthik Narasimhan, Igor Gilitschenski, Vishvak Murahari*, Yash Kant*
Coming Soon!
We introduce an automated Annotation and Video Stream Alignment Pipeline (abbreviated ASAP) for aligning unlabeled videos of four different sports (Cricket, Football, Basketball, and American Football) with their corresponding dense annotations (commentary) freely available on the web. Our human studies indicate that ASAP can align videos and annotations with high fidelity, precision, and speed!
LaTeRF: Label and Text Driven Object Radiance Fields
Ashkan Mirzaei, Yash Kant, Jonathan Kelly, and Igor Gilitschenski
ECCV, 2022
arXiv / code
We build a simple method to extract an object from a scene given 2D images, camera poses, a natural language description of the object, and a few annotated pixels of object and background.
Housekeep: Tidying Virtual Households using Commonsense Reasoning
Yash Kant, Arun Ramachandran, Sriram Yenamandra, Igor Gilitschenski, Dhruv Batra, Andrew Szot*, and Harsh Agrawal*
ECCV, 2022
arXiv / project page / code / colab
Housekeep is a benchmark to evaluate commonsense reasoning in the home for embodied AI. Here, an embodied agent must tidy a house by rearranging misplaced objects without explicit instructions.

To capture the rich diversity of real world scenarios, we support cluttering environments with ~1800 everyday 3D object models spread across ~270 categories!
Contrast and Classify: Training Robust VQA Models
Yash Kant, Abhinav Moudgil, Dhruv Batra, Devi Parikh, Harsh Agrawal
ICCV, 2021
arXiv / project page / code / slides

We propose a training scheme which steers VQA models towards answering paraphrased questions consistently, and we ended up beating previous baselines by an absolute 5.8% on consistency metrics without any performance drop!

Spatially Aware Multimodal Transformers for TextVQA
Yash Kant, Dhruv Batra, Peter Anderson, Jiasen Lu, Alexander Schwing, Devi Parikh, Harsh Agrawal
ECCV, 2020
arXiv / project page / code / short talk / long talk / slides

We built a self-attention module to reason over spatial graphs in images. We ended up with an absolute performance improvement of more than 4% on two TextVQA bechmarks!

Automated Video Description for Blind and Low Vision Users
Aditya Bodi, Pooyan Fazli, Shasta Ihorn, Yue-Ting Siu, Andrew T Scott, Lothar Narins,
Yash Kant, Abhishek Das, Ilmi Yoon
CHI Extended Abstracts, 2021

We built a system to automatically generate descriptions for videos and answer blind and low vision users’ queries on the videos!



I borrowed this template from Jon Barron's website.