Yash Kant

Hi! I am a Ph.D. student in Department of Computer Science and Robotics Group at University of Toronto. I am advised by Igor Gilitschenski.

Currently, I am interning at Snap Research in Sergey Tulyakov's team. Here, I am trying to build neural representations for deformable 3D objects.

Previously, I was a Research Visitor at Georgia Tech advised by Devi Parikh and Dhruv Batra for two years. There, I built Visual Question Answering models that can read and are robust. I also built a benchmark to measure commonsense in embodied AI agents.

I enjoy talking to people and building (hopefully useful) things together. :)

If you want to discuss research/collaborate, feel free to send me an email!

Email  /  CV  /  Github  /  Google Scholar  /  Twitter  /  LinkedIn

profile photo
Housekeep: Tidying Virtual Households using Commonsense Reasoning
Yash Kant, Arun Ramachandran, Sriram Yenamandra, Igor Gilitschenski, Dhruv Batra, Andrew Szot*, and Harsh Agrawal*
arXiv / project page
Housekeep is a benchmark to evaluate commonsense reasoning in the home for embodied AI. Here, an embodied agent must tidy a house by rearranging misplaced objects without explicit instructions specifying which objects need to be rearranged.

To capture the rich diversity of real world scenarios, we support cluttering 14 household environments with nearly 1800 everyday 3D object models spread across nearly 270 categories!
Contrast and Classify: Training Robust VQA Models
Yash Kant, Abhinav Moudgil, Dhruv Batra, Devi Parikh, Harsh Agrawal
International Conference on Computer Vision (ICCV), 2021
Self-Supervised Learning Workshop at NeurIPS, 2020
arXiv / project page / code / slides

We propose a training scheme which steers VQA models towards answering paraphrased questions consistently, and we ended up beating previous baselines by an absolute 5.8% on consistency metrics without any performance drop!

Spatially Aware Multimodal Transformers for TextVQA
Yash Kant, Dhruv Batra, Peter Anderson, Jiasen Lu, Alexander Schwing, Devi Parikh, Harsh Agrawal
European Conference on Computer Vision (ECCV), 2020
VQA Workshop at CVPR, 2020
arXiv / project page / code / short talk / long talk / slides

We built a self-attention module to reason over spatial graphs in images. We ended up with an absolute performance improvement of more than 4% on two TextVQA bechmarks!

Automated Video Description for Blind and Low Vision Users
Aditya Bodi, Pooyan Fazli, Shasta Ihorn, Yue-Ting Siu, Andrew T Scott, Lothar Narins,
Yash Kant, Abhishek Das, Ilmi Yoon
CHI Extended Abstracts 2021

We built a system to automatically generate descriptions for videos and answer blind and low vision users’ queries on the videos!



I borrowed this template from Jon Barron's website.