Join the virtual Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision. **Date, Time and Location** Oct 30, 2025 9 AM Pacific Online. **[Register for the Zoom!](https://voxel51.com/events/ai-ml-and-computer-vision-meetup-october-30-2025)** **The Agent Factory: Building a Platform for Enterprise-Wide AI Automation** In this talk we will explore what it takes to build an enterprise-ready AI automation platform at scale. The topics covered will include: * The Scale Challenge: E-commerce environments expose the limitations of single-point AI solutions, which create fragmented ecosystems lacking cohesion and efficient resource sharing across complex, knowledge-based work. * Root Cause Analysis Success: Flipkart’s initial AI agent transformed business analysis from days-long investigations to near-instantaneous insights, proving the concept while revealing broader platform opportunities. * Platform Strategy Evolution: Success across Engineering (SDLC, SRE), Operations, and Commerce teams necessitated a unified, multi-tenant platform serving diverse use cases with consistency and operational efficiency. * Architectural Foundation: Leveraging framework-agnostic design principles we were able to emphasize modularity, which enabled teams to leverage different AI models while maintaining consistent interfaces and scalable infrastructure. * The “Agent Garden” Vision: Flipkart’s roadmap envisions an internal ecosystem where teams discover, deploy, and contribute AI agents, providing a practical blueprint for scalable AI agent infrastructure development. *About the Speaker* [Virender Bhargav](https://www.linkedin.com/in/virender-bhargav/) at Flipkart is a seasoned engineering leader whose expertise spans business technology integration, enterprise applications, system design/architecture, and building highly scalable systems. With a deep understanding of technology, he has spearheaded teams, modernized technology landscapes, and managed core platform layers and strategic products. With extensive experience driving innovation at companies like Paytm and Flipkart, his contributions have left a lasting impact on the industry. **Scaling Generative Models at Scale with Ray and PyTorch** Generative image models like Stable Diffusion have opened up exciting possibilities for personalization, creativity, and scalable deployment. However, fine-tuning them in production‐grade settings poses challenges: managing compute, hyperparameters, model size, data, and distributed coordination are nontrivial. In this talk, we’ll dive deep into learning how to fine-tune Stable Diffusion models using Ray Train (with HuggingFace Diffusers), including approaches like DreamBooth and LoRA. We’ll cover what works (and what doesn’t) in scaling out training jobs, handling large data, optimizing for GPU memory and speed, and validating outputs. Attendees will come away with practical insights and patterns they can use to fine-tune generative models in their own work. *About the Speaker* [Suman Debnath](https://www.linkedin.com/in/suman-d/) is a Technical Lead (ML) at Anyscale, where he focuses on distributed training, fine-tuning, and inference optimization at scale on the cloud. His work centers around building and optimizing end-to-end machine learning workflows powered by distributed computing framework like Ray, enabling scalable and efficient ML systems. Suman’s expertise spans Natural Language Processing (NLP), Large Language Models (LLMs), and Retrieval-Augmented Generation (RAG). Earlier in his career, he developed performance benchmarking and monitoring tools for distributed storage systems. Beyond engineering, Suman is an active community contributor, having spoken at over 100 global conferences and events, including PyCon, PyData, ODSC, AIE and numerous meetups worldwide. **Privacy-preserving in Computer Vision through Optics Learning** Cameras are now ubiquitous, powering computer vision systems that assist us in everyday tasks and critical settings such as operating rooms. Yet, their widespread use raises serious privacy concerns: traditional cameras are designed to capture high-resolution images, making it easy to identify sensitive attributes such as faces, nudity, or personal objects. Once acquired, such data can be misused if accessed by adversaries. Existing software-based privacy mechanisms, such as blurring or pixelation, often degrade task performance and leave vulnerabilities in the processing pipeline. In this talk, we explore an alternative question: how can we preserve privacy before or during image acquisition? By revisiting the image formation model, we show how camera optics themselves can be learned and optimized to acquire images that are unintelligible to humans yet remain useful for downstream vision tasks like action recognition. We will discuss recent approaches to learning camera lenses that intentionally produce privacy-preserving images, blurry and unrecognizable to the human eye, but still effective for machine perception. This paradigm shift opens the door to a new generation of cameras that embed privacy directly into their hardware design. *About the Speaker* [Carlos Hinojosa](https://www.linkedin.com/in/phdcarloshinojosa/) is a Postdoctoral researcher at King Abdullah University of Science and Technology (KAUST) working with Prof. Bernard Ghanem. His research interests span Computer Vision, Machine Learning, AI Safety, and AI for Science. He focuses on developing safe, accurate, and efficient vision systems and machine-learning models that can reliably perceive, understand, and act on information, while ensuring robustness, protecting privacy, and aligning with societal values. **It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data** Can we match vision and language embeddings without any supervision? According to the platonic representation hypothesis, as model and dataset scales increase, distances between corresponding representations are becoming similar in both embedding spaces. Our study demonstrates that pairwise distances are often sufficient to enable unsupervised matching, allowing vision-language correspondences to be discovered without any parallel data. *About the Speaker* [Dominik Schnaus](https://www.linkedin.com/in/dominik-schnaus/) is a third-year Ph.D. student in the Computer Vision Group at the Technical University of Munich (TUM), supervised by Daniel Cremers. His research centers on multimodal and self-supervised learning with a special emphasis on understanding similarities across embedding spaces of different modalities.