Keynote Talks

Venkatesh Babu Radhakrishnan

Indian Institute of Science (IISc), Bangalore

Title: Towards Fair and Controllable Diffusion Models

Time: 11:00 am to 12:00 noon. Wednesday, 24.09.2025

Abstract: Diffusion models have transformed text-to-image generation, but challenges remain in fairness, representativeness, and user control. In this talk, we present some of our efforts that address these critical gaps. We begin by examining the demographic and geographic biases in popular generative models, showing over-representation of certain regions and attributes. To mitigate the biases in generative models, we propose distribution-guided debiasing methods that align outputs with desired attribute distributions without retraining, enabling fairer and more inclusive generations. Beyond fairness, we introduce fine-grained control mechanisms, enabling precise attribute editing and identity preservation, bridging realism with user-driven customization. We extend controllability to spatial reasoning with affordance-aware text-guided human placement, ensuring semantically plausible compositions, while the proposed zero-shot, depth-aware editing enables realistic scene modifications without additional supervision. We hope these contributions help in making the generative models that are equitable, transparent, and highly controllable for real-world applications.

 

Alex Kolesnikov

OpenAI

Title: A Journey Toward Unified Vision Models

Time: 2:30 pm to 3:30 pm. Wednesday, 24.09.2025

Abstract: Text‑only transformer models thrive on a single, simple interface: next‑token prediction. They then let scale and data do the heavy lifting. By contrast, models in the vision domain remain fragmented and are hindered by non‑trivial components such as box proposals, non‑maximum suppression, and matching losses. In this talk, I’ll share insights from my research journey toward simpler, more unified vision models.

I’ll trace a path through three projects. First, UViM shows how to express structured outputs, such as panoptic segmentation masks and depth maps, as discrete codes that a vanilla autoregressive transformer can generate. Next, I’ll dive into policy‑gradient RL fine‑tuning, addressing fundamental limitations of pure log‑likelihood training by optimizing the metrics we actually care about. Finally, I’ll introduce JetFormer, a decoder‑only autoregressive transformer capable of full end‑to‑end modeling of high‑resolution images.

Dima Damen

University of Bristol and Google DeepMind

Title: Opportunities in Egocentric Vision

Time: 10:30 am to 11:30 am. Thursday, 25.09.2025

Abstract: Forecasting the rise of wearable devices equipped with audio-visual feeds, this talk will present opportunities for research in egocentric video understanding. The talk argues for new ways to foresee egocentric videos as partial observations of a dynamic 3D world, where objects are out of sight but not out of mind. I’ll review new data collection and annotation HD-EPIC (https://hd-epic.github.io/) that merges video understanding with 3D modelling, showcasing current failures of VLMs in understanding the perspective outside the camera’s field of view — a task trivial for humans. 

All projects details are at: dimadamen.github.io/index.html

 

Stefanie Jegelka

MIT EECS and TU Munich

Title: Does computational structure tell us about deep learning? Some thoughts and examples

Time: 3:30 pm to 4:30 pm. Thursday, 25.09.2025

Abstract: Understanding and steering deep learning training and inference is a nontrivial endeavor. In this talk, I will look at training, learning and inference from the perspective of computational structure, via a few diverse examples. First, computational structure may help understand expressiveness and biases in deep learning models. For instance, it can connect graph neural networks to SDPs, indicating their capability of learning optimal approximation algorithms. It can also help explain position biases in LLMs. Second, computational structure exists not only in the architecture but also in inference procedures such as chain-of-thought. Finally, if time permits, we will connect architectural structure via neural parameter symmetries to the training and loss landscape of deep models and explore the effect of removing symmetries.

Efstratios Gavves

University of Amsterdam and Ellogon.AI

Title: Cyberphysical World Models and Agents

Time: 9:00 am to 10:00 am. Friday, 26.09.2025

Abstract: Artificial intelligence has moved from passive perception to active interaction, yet current systems remain limited in their ability to reason about the physical and causal structure of the world. We propose cyberphysical world models as a new paradigm that unites perception with governing mechanisms of dynamics and causality. These models go beyond appearance-based representations by encoding the properties, interactions, and consequences that underpin real-world processes. Building on digital twins, mechanistic neural networks, and scalable causal learning, I will be describing a vision –and recent works—towards embodied agents that can predict outcomes, plan interventions, and adapt through curiosity-driven exploration. The resulting cyberphysical agents will be offering a pathway toward reliable, trustworthy, and autonomous systems, bridging the gap between data-driven learning and the physical world.