Matthias Bethge
University of Tübingen, Germany

Learning to see like humans

Abstract: How can we teach machines to see the world like humans? Taking inspiration from the ventral pathway in the visual brain, convolutional neural networks (CNNs) have become a key tool for solving computer vision problems—often reaching human-level performance on benchmark tasks like object recognition or detection. Despite these successes perceptual decision making and generalization in machines is still very different from humans. In this talk, I will present ongoing work of my lab to better understand these differences between human vision and CNNs studying constrained architectures, adversarial testing, and out-of-domain generalization.

Bio: Matthias Bethge is Professor for Computational Neuroscience and Machine Learning at the University of Tübingen and director of the Tübingen AI Center (, a joint center between Tübingen University and MPI for Intelligent Systems that is part of the German AI strategy. He is also an Amazon scholar and co-founder of Deepart UG (, and Layer7 AI GmbH (, and co-initiator of the European ELLIS initiative ( His main research focus is on robust vision and neural decision making with the goal to advance internal model learning with neural networks ( He received the first Bernstein Prize for Computational Neuroscience in 2006 and later became director of the Bernstein Center Tübingen and vice chair of the German Bernstein network ( His work on neural style transfer was among the top-ten most popular publications in 2015 among all disciplines (altmetric). He has been serving as area chair for various conferences such as NeurIPS, ICLR, Cosyne and as general chair for the Bernstein conference, and initiated the “BWKI”, a German-wide school competition for AI (

Sabine Süsstrunk
EPFL, Switzerland

Editing in Style: Uncovering the Local Semantics of GANs

Abstract: While the quality of GAN image synthesis has tremendously improved in recent years, our ability to control and condition the output is still limited. Focusing on StyleGAN, we introduce a simple and effective method for making local, semantically-aware edits to a target output image. This is accomplished by borrowing elements from a source image, also a GAN output, via a novel manipulation of style vectors. Our method requires neither supervision from an external model, nor involves complex spatial morphing operations. Instead, it relies on the emergent disentanglement of semantic objects that is learned by StyleGAN and StyleGAN 2 during its training. Semantic editing is demonstrated on GANs producing human faces, indoor scenes, cats, and cars. We measure the locality and photorealism of the edits produced by our method, and find that it accomplishes both. This is joint work with Edo Collins, now at Google, and Raja Bala and Bob Price from PARC.

Bio: Prof. Dr. Sabine Süsstrunk leads the Image and Visual Representation Lab in the School of Computer and Communication Sciences (IC) at EPFL since 1999. Her main research areas are in computational photography, computational imaging, color image processing and computer vision, machine learning, and computational image quality and aesthetics. Sabine has authored and co-authored over 150 publications, of which 7 have received best paper/demo awards, and holds over 10 patents. Sabine is Founding Member and Member of the Board (President 2014-2018) of the EPFL WISH (Women in Science and Humanities) Foundation, Member of the Foundation Council of the SNSF (Swiss National Science Foundation), Member of the Board of the SRG SSR (Swiss Radio and Television), and Member of the Board of Largo Films SA. She received the IS&T/SPIE 2013 Electronic Imaging Scientist of the Year Award for her contributions to color imaging, computational photography, and image quality, and the 2018 IS&T Raymond C. Bowman Award for dedication in preparing the next generation of imaging scientists. Sabine is a Fellow of IEEE and IS&T.

Vittorio Ferrari
Google Zürich, Switzerland

Our recent research on 3D Deep Learning

Abstract: I will present three recent projects within the 3D Deep Learning research line from my team at Google Research:
(1) A neural network model for reconstructing the 3D shape of multiple objects appearing in a single RGB image (ECCV’20).
(2) A new conditioning scheme for normalizing flow models. It enables several applications such as reconstructing an object’s 3D point cloud from an image, or the converse problem of rendering an image given a 3D point cloud (CVPR’20).
(3) A neural rendering framework that maps a voxelized object into a high quality image. It renders highly textured objects and illumination effects such as reflections and shadows realistically. It allows controllable rendering: geometric and appearance modifications in the input are accurately represented in the final rendering (CVPR’20).

Bio: Vittorio Ferrari is a Senior Staff Research Scientist at Google, where he leads a research group on visual learning. He received his PhD from ETH Zurich in 2004, then was a post-doc at INRIA Grenoble (2006-2007) and at the University of Oxford (2007-2008). Between 2008 and 2012 he was an Assistant Professor at ETH Zurich, funded by a Swiss National Science Foundation Professorship grant. In 2012-2018 he was faculty at the University of Edinburgh, where he became a Full Professor in 2016 (now he is a Honorary Professor). In 2012 he received the prestigious ERC Starting Grant, and the best paper award from the European Conference in Computer Vision. He is the author of over 120 technical publications. He regularly serves as an Area Chair for the major computer vision conferences, he was a Program Chair for ECCV 2018 and is a General Chair for ECCV 2020. He is an Associate Editor of IEEE Pattern Analysis and Machine Intelligence. His current research interests are in learning visual models with minimal human supervision, human-machine collaboration, and 3D Deep Learning.

Anna Khoreva
Bosch Center for Artificial Intelligence, Germany

Improving Image Synthesis of GANs with Segmentation-Based Discriminators

Abstract: The quality of synthetic images produced by generative adversarial networks (GANs) has seen tremendous improvement recently. However, despite the recent advances, learning to synthesize globally and locally coherent images with object shapes and textures indistinguishable from real images remains challenging. One source of the problem lies potentially in the discriminator network.
The discriminator aims to model the data distribution, acting as a loss function to provide the generator a learning signal to synthesize realistic image samples. The stronger the discriminator is, the better the generator has to become. In the current state-of-the-art GAN models, the discriminator being a classification network learns only a representation that allows to efficiently penalize the generator based on the most discriminative difference between real and synthetic images. The problem amplifies as the discriminator has to learn in a non-stationary environment (the distribution of synthetic samples shifts as the generator constantly changes through training) and is prone to forgetting previous tasks (in the context of the discriminator training, learning semantics, structures, and textures can be considered different tasks). This discriminator is not incentivized to maintain a more powerful data representation, learning both global and local image differences as well as the object semantics. This often results in the generated images with discontinued local structures and geometrically or semantically incoherent patterns (e.g. asymmetric faces or animals with extra legs).
To target this issue we propose an alternative discriminator architecture, re-designing the discriminator as an encoder-decoder segmentation network tuned based on the specific unconditional or conditional image synthesis task at hand. The proposed architectural change allows to provide detailed spatially-aware feedback to the generator while maintaining the global coherence of synthesized images. This leads to a stronger discriminator, which is encouraged to maintain a more powerful data representation, making the generator task of fooling the discriminator more challenging and thus improving the quality of generated samples. The novel discriminator improves over the state of the art models (BigGAN, SPADE) across different datasets and tasks in terms of the standard distribution and image quality metrics, enabling the generator to synthesize images with varying structure, appearance and levels of detail, maintaining semantics as well as global and local realism.

Bio: Anna Khoreva is leading the Data Efficient Deep Learning research group at the Bosch Center for Artificial Intelligence (BCAI). Her research interests lie in the field of data efficient deep learning, with a particular focus on generative modeling, image and video synthesis, unsupervised and weakly supervised learning. Before joining BCAI, Anna was a postdoctoral researcher at the Max Planck Institute for Informatics with Prof. Bernt Schiele. She received her master’s degree in Visual Computing (2014) and PhD in Computer Science (2017) from Saarland University, where she worked on weakly supervised image and video segmentation.

Bernt Schiele
MPI Saarbrücken, Germany

The Bright and Dark Sides of Computer Vision and Machine Learning —Challenges and Opportunities for Robustness and Security

Abstract: Computer Vision has been revolutionized by Machine Learning and in particular Deep Learning. For many problems which have been studied for decades, state-of the art performance has dramatically improved by using artificial neural networks. However, these methods come with their own challenges concerning robustness and security. In this talk I will summarize some of our recent efforts in this space. E.g., while context information is essential for best performance, it might lead to overconfident or even wrong predictions of our methods. Also, I will discuss new insights about reverse engineering deep neural networks as well as stealing the entire functionality of them cheaply. While we are clearly at the infancy of understanding robustness and security implications of deep neural networks, the talk aims to raise awareness as well as to motivate more researchers to address these important challenges.

Bio: Bernt Schiele has been Max Planck Director at MPI for Informatics and Professor at Saarland University since 2010. He studied computer science at the University of Karlsruhe, Germany. He worked on his master thesis in the field of robotics in Grenoble, France, where he also obtained the “diplome d’etudes approfondies d’informatique”. In 1994 he worked in the field of multi-modal human-computer interfaces at Carnegie Mellon University, Pittsburgh, PA, USA in the group of Alex Waibel. In 1997 he obtained his PhD from INP Grenoble, France under the supervision of Prof. James L. Crowley in the field of computer vision. The title of his thesis was “Object Recognition using Multidimensional Receptive Field Histograms”. Between 1997 and 2000 he was postdoctoral associate and Visiting Assistant Professor with the group of Prof. Alex Pentland at the Media Laboratory of the Massachusetts Institute of Technology, Cambridge, MA, USA. From 1999 until 2004 he was Assistant Professor at the Swiss Federal Institute of Technology in Zurich (ETH Zurich). Between 2004 and 2010 he was Full Professor at the computer science department of TU Darmstadt.

Siyu Tang
ETH Zürich, Switzerland

Generating People Interacting with 3D Scenes

Abstract: High fidelity digital 3D environments have been proposed in recent years, however, it remains extreme challenging to automatically populate such environment with realistic human bodies. Existing work utilizes images, depths or semantic maps to represent the scene, and parametric human models to represent 3D bodies in the scene. While being straightforward, their generated human-scene interactions are often lack of naturalness and physical plausibility. Our key observation is that humans interact with the world through body scene contact. To explicitly and effectively represent the physical contact between the body and the world is essential for modeling human-scene interaction. To that end, we propose a novel interaction representation, which explicitly encodes the proximity between the human body and the 3D scene around it. Specifically, given a set of basis points on a scene mesh, we leverage a conditional variational autoencoder to synthesize the distance from every basis point to its closest point on a human body.  The synthesized proximal relationship between human body and the scene can indicate which region a person tends to contact.  Furthermore, based on such synthesized proximity, we are able to effectively obtain expressive 3D human bodies that interact with the 3D scene naturally.  Our perceptual study shows that our model significantly improves the state-of-the-art method, approaching the realism of real human-scene interaction.  We believe our method makes an important step towards the fully automatic synthesis of realistic 3D human bodies in 3D scenes.

Bio: Siyu Tang is an assistant professor at ETH Zürich in the Department of Computer Science since January 2020. She received an early career research grant to start her own research group at the Max Planck Institute for Intelligent Systems in November 2017. She finished her PhD (summa cum laude) at the Max Planck Institute for Informatics and Saarland University in September 2017, under the supervision of Professor Bernt Schiele. Before that, she received her Master’s degree in Computer Science at RWTH Aachen University, advised by Prof. Bastian Leibe and her Bachelor degree in Computer Science at Zhejiang University, China. She was a research intern at the Japanese National Institute of Informatics. Dr. Tang received the DAGM-MVTec Dissertation Award in 2018 and the ELLIS PhD Award in 2019. She was the winner of the multi-object tracking challenge at ECCV’16 and CVPR’17. She also received a Best Paper Award for her work “Detection and tracking of occluded people” at BMVC 2012.

Christoph Lampert
IST Austria

Learning Robustly from Multiple Sources

Abstract: We study the problem of learning from multiple untrusted data sources, a scenario of increasing practical relevance given the recent emergence of crowdsourcing and collaborative learning paradigms. Specifically, we analyze the situation in which a learning system obtains datasets from multiple sources, some of which might be biased or even adversarially perturbed. It is known that in the single-source case, an adversary with the power to corrupt a fixed fraction of the training data can prevent “learnability”, that is, even in the limit of infinitely much training data, no learning system can approach the optimal test error. I present recent work with Nikola Konstantinov in which we show that, surprisingly, the same is not true in the multi source setting, where the adversary can arbitrarily corrupt a fixed fraction of the data sources.

Bio: Christoph Lampert received the PhD degree in mathematics from the University of Bonn in 2003. After postdoctoral positions at the German Research Center for Artificial Intelligence (DFKI) and the Max-Planck Institute for Biological Cybernetics, he joined the Institute of Science and Technology Austria (IST Austria) in 2010, first as Assistant Professor and since 2015 as Professor. Since 2019 he is the head of the ELLIS unit at IST Austria. His research interests include machine learning and computer vision. For his research at the interface of these fields he was awarded an ERC Starting Grant (consolidator phase) by the European Research Council. He is an Associate Editor of the IEEE Transaction on Pattern Analysis and Machine Intelligence (TPAMI), Editor of the International Journal of Computer Vision (IJCV) and Action Editor of the Journal for Machine Learning Research (JMLR).

Davide Scaramuzza
University of Zürich, Switzerland

Vision-based Autonomous Drones: State of the Art and the Road Ahead

Abstract: In the past three years we witnessed the rise of micro drones with weight ranging from 30g up to 500g performing autonomous agile maneuvers with onboard sensing and computation. In this talk, I will summarize the key scientific and technological achievements and the next challenges.

Bio: Davide Scaramuzza (Italian) is Professor of Robotics and Perception at both departments of Informatics (University of Zurich) and Neuroinformatics (University of Zurich and ETH Zurich), where he directs the Robotics and Perception Group. He did his PhD in robotics and computer vision at ETH Zurich (with Roland Siegwart) and a postdoc at the University of Pennsylvania (with Vijay Kumar and Kostas Daniilidis). His research lies at the intersection of robotics, computer vision, and machine learning, using standard or neuromorphic cameras, and is aimed at enabling autonomous, agile, navigation of micro drones in search and rescue applications. For his research contributions, he received several awards: the IEEE Robotics and Automation Society Early Career Award, an ERC Grant, a Google Research Award, KUKA, Qualcomm, and Intel awards, the European Young Research Award, the Misha Mahowald Neuromorphic Engineering Award, and several paper awards. He coauthored the book “Introduction to Autonomous Mobile Robots” (published by MIT Press; 10,000 copies sold) and more than 100 papers on robotics and perception published in top-ranked journals (TRO, PAMI, IJCV, IJRR) and conferences (RSS, ICRA, CVPR, ICCV). In 2015, he cofounded Zurich-Eye, dedicated to the commercialization of visual-inertial navigation solutions for mobile robots, which later became Facebook Oculus Zurich. He was also the strategic advisor of Dacuda, an ETH spinoff dedicated to inside-out VR solutions, which later became Magic Leap Zurich. Many aspects of his research have been prominently featured in wider media, such as The New York Times, BBC News, Discovery Channel, La Repubblica, Neue Zurcher Zeitung and in technology-focused media, such as IEEE Spectrum, MIT Technology Review, Tech Crunch, Wired, The Verge.