Previously, I was a Research Visitor at Georgia Tech advised by Devi Parikh and Dhruv Batra for two years (2019-21). There, I built Visual Question Answering models that can read and are robust, and a benchmark to measure commonsense in embodied AI agents.
I enjoy talking to people and building (hopefully useful) things together. :)
Repurposing Diffusion Inpainters for Novel View Synthesis Yash Kant, Aliaksandr Siarohin, Michael Vasilkovsky, Riza Alp Guler, Jian Ren, Sergey Tulyakov, Igor Gilitschenski
In Submission. Coming Soon! We perform novel view synthesis from single image by repurposing Stable Diffusion inpainting model, and depth based 3D unprojection.
We conduct largescale training on ~ 19M rendered Objaverse images, and validate results on Google Scanned Objects, CO3D, and RTMV.
Invertible Neural Skinning Yash Kant, Aliaksandr Siarohin, Riza Alp Guler, Menglei Chai, Jian Ren, Sergey Tulyakov, Igor Gilitschenski
CVPR, 2023 arXiv /
project page We propose an end-to-end invertible and learnable reposing pipeline that allows animating implicit surfaces with intricate pose-varying effects. We outperform the state-of-the-art reposing techniques on clothed humans while preserving surface correspondences and being order of magnitude faster!
Building Scalable Video Understanding Benchmarks through Sports
Aniket Agarwal^, Alex Zhang^, Karthik Narasimhan, Igor Gilitschenski, Vishvak Murahari*, Yash Kant* arXiv /
project page /
code We introduce an automated Annotation and Video Stream Alignment Pipeline (abbreviated ASAP) for aligning unlabeled videos of four different sports (Cricket, Football, Basketball, and American Football) with their corresponding dense annotations (commentary) freely available on the web. Our human studies indicate that ASAP can align videos and annotations with high fidelity, precision, and speed!
LaTeRF: Label and Text Driven Object Radiance Fields
Ashkan Mirzaei, Yash Kant, Jonathan Kelly, and Igor Gilitschenski
ECCV, 2022 arXiv /
code We build a simple method to extract an object from a scene given 2D images, camera poses, a natural language description of the object, and a few annotated pixels of object and background.
We propose a training scheme which steers VQA models towards answering paraphrased questions consistently, and we ended up beating previous baselines by an absolute 5.8% on consistency metrics without any performance drop!
We built a system to automatically generate descriptions for videos and answer blind and low vision users’ queries on the videos!
Adding Complement Objective Training to Pythia: I experimented with adding
Complement Objective Training in FAIR's vision and language framework Pythia and also wrote a
report on my findings here,
the code is here.