Monday, October 13: Karen Livescu
Keynote sponsored by Cisco
On the (co-)evolution of universal written, spoken, and signed language processing
Abstract: Natural language processing research has evolved over the past few years from mainly task-specific models, to task-independent representation models fine-tuned for specific tasks, and finally to fully task-independent language models. This progression addresses a desire for universality in the sense of handling arbitrary tasks in the same model. Another dimension of universality is the ability to serve arbitrary types of language users, regardless of their choice of language, dialect, or other individual properties. Progress toward universality has historically been addressed largely independently in separate research communities focusing on written, spoken, and signed language, although they share many similarities. This talk will trace the recent progress toward universality in these three language modalities, while highlighting a few pieces of recent work.
Bio: Karen Livescu is a Professor at TTI-Chicago. She completed her PhD in electrical engineering and computer science at MIT in 2005 and her bachelor’s degree in physics at Princeton University. She is a Fellow of the IEEE and ISCA. She has served as a program chair/co-chair for ICLR, Interspeech, and ASRU, and as an Associate Editor for TACL, IEEE T-PAMI, IEEE T-ASLP, and others. Her group’s work spans a variety of topics in spoken, written, and signed language processing, with a particular interest in representation learning, cross-modality learning, and low-resource settings.
Tuesday, October 14: Alexandre Défossez
Keynote sponsored by Treble
Text-Speech tasks as Delayed Stream Modeling
Abstract: Speech and audio encompass a variety of tasks: source separation, diarization, transcription, TTS, translation, speech-to-speech etc. Each comes with its own training objective, specific architecture, or training dataset. In this talk, I will present how our team has been systematically using the same approach — delayed stream modeling — and training across a wide range of speech and speech-text tasks. The framework provides many benefits: efficient long-form streaming and batched inference using decoder-only Transformers, shared pre-training and hyper-parameters across applications, controllability, and more.
Bio: Alexandre is a co-founder at Kyutai, a non profit lab for research in artificial intelligence based in Paris. Kyutai’s mission is to lead bleeding edge research and to make it accessible through open science and open source. We released the speech-to-speech conversational AI Moshi, and recently Hibiki, the first simultaneous speech translation model that can run on a phone. Before that, Alexandre was a scientist for 3 years at Facebook AI Research in Paris, where he led the development of models for audio compression and modeling (AudioCraft, MusicGen, EnCodec). He graduated in mathematics from École Normale Supérieure, and did his PhD between INRIA and FAIR Paris on music source separation.
Wednesday, October 15: Juan Pablo Bello
Keynote sponsored by Adobe
Reframing SELD: Learned Localization, Multichannel Processing, and the Beamforming Gap
Abstract: This talk examines the evolution of Sound Event Localization and Detection (SELD) through the lens of multichannel audio processing. The discussion begins with the field’s early reliance on channel-independent models, whose inability to capture inter-channel structure limited localization performance. This is followed by the emergence of approaches that incorporate classical spatial features—such as generalized cross-correlation (GCC) and intensity vectors—to encode spatial relationships explicitly. Particular attention is given to two areas where advances in speech processing offer new directions for SELD. First, learned spatial modeling remains underexplored in SELD compared to speaker localization, where architectures based on cross-channel attention, graph neural networks, and geometry-aware representations are behind recent progress in the field. Second, beamforming-based methods—including powerful neural beamformers widely used in speech enhancement and recognition—have been largely overlooked in SELD. These techniques present significant potential for extending SELD research into open vocabularies, improving multichannel detection, and enabling spatial analysis of individual sources in complex environments. The talk concludes by reflecting on SELD’s current trajectory and outlining key opportunities and challenges for future research.
Bio: Juan Pablo Bello is a Professor of Music Technology, Computer Science & Engineering, Electrical & Computer Engineering, and Urban Science at New York University. In 1998 he received a BEng in Electronics from the Universidad Simón Bolívar in Caracas, Venezuela, and in 2003 he earned a doctorate in Electronic Engineering at Queen Mary, University of London. Juan’s expertise is in machine listening, audio signal processing, and music information retrieval. He has published close to 200 papers and articles in books, journals and conference proceedings. Since 2016, he is the director of the Music and Audio Research Lab (MARL), a multidisciplinary research center at the intersection of science, technology, music and sound. Between 2019-2022 He was also the director of the NYU Center for Urban Science and Progress (CUSP). A fellow of the IEEE and a Fulbright scholar, his work has been supported by public and private institutions in Venezuela, the UK, and the US, including Frontier and CAREER awards from the National Science Foundation.


