Jeet Vora

I’m a researcher working at the intersection of Computer Vision, Video Understanding, Multi-Camera Visual Perception, and Spatio-Temporal Learning, with growing research interests in 3D Vision, Dynamic 4D Scene Understanding, and World Models. My work combines fundamental research with large-scale real-world AI systems, spanning applications in creative media and sports analytics.

I’m currently working as a Research Consultant with the IVUL Lab at King Abdullah University of Science and Technology (KAUST), collaborating with Prof. Bernard Ghanem, Dr. Silvio Giancola, Dr. Merey Ramazanova, and Jintao Ma on large-scale video understanding, multimodal perception, and spatio-temporal learning for sports analytics.

Previously, I worked as a Research Engineer at Animaker, where I led AI research and development for large-scale video creation platforms. My work involved designing and deploying production-grade AI systems across products like Steve.ai, Animaker, Picmaker and Vmaker, spanning problems such as Generative Video AI, Script-to-Video Generation, Talking-Head Animation, Video Matting, and Multimodal Retrieval.

Before this, I completed my Master’s by Research in Computer Science at IIIT Hyderabad. I was advised by Dr. Vineet Gandhi and was associated with the CVIT Lab. My research focused on Multi-Camera Detection and Tracking, specifically focusing on robust Generalization in Deep Multi-View Pedestrian Detection across unseen camera configurations and environments. As part of this work, I explored simulation-to-real (Sim2Real) transfer by synthetically generating datasets using the GTA-V and Unity game engines.

Alongside my research and engineering work, I’m passionate about mentorship and teaching. I’ve mentored students and professionals in end-to-end AI/ML projects through collaborations with TalentSprint and InLustro Learning. I’ve also delivered hands-on sessions for institutional clients including L&T Madh Training Academy and MLR Institute of Technology.

More broadly, I’m interested in advancing machine perception through dynamic scene understanding, visual reasoning, and structured representations of dynamic environments. My long-term goal is to develop intelligent systems capable of understanding, reconstructing, and reasoning about the visual world while bridging fundamental research and impactful real-world applications.