This year, the best 10% of submitted papers are given a “Spotlight” oral slot. Out of these, 10 papers have further been selected as award candidates. The award selection process is led by the Awards Chair, Prof. Reinhold Haeb-Umbach, and a separate award committee.
The awards are made available thanks to some of our Gold Sponsors: Meta Reality Labs for the Best Paper Award, and Amazon and Mitsubishi Electric Research Labs (MERL) for the Best Student Paper Awards.
Final selection was based on both quality of the paper and quality of the presentation at the workshop.
Best Paper and Student Paper Awards
The awards were announced at the Award Ceremony during the Tuesday dinner.
Best Paper Award (sponsored by Meta Reality Labs):
Xilin Jiang, Junkai Wu, Vishal Choudhari, Nima Mesgarani
for the paper titled Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation (O2-3)
Best Student Paper Award (sponsored by Amazon):
You Zhang, for the paper titled Towards Perception-Informed Latent HRTF Representations (O1-5)
Co-authors: Andrew Francl, Ruohan Gao, Paul Calamia, Zhiyao Duan, Ishwarya Ananthabhotla
Best Student Paper Award (sponsored by Mitsubishi Electric Research Labs):
Krishna Subramani, for the paper titled Rethinking Non-Negative Matrix Factorization with Implicit Neural Representations (O3-1)
Co-authors: Paris Smaragdis, Takuya Higuchi, Mehrez Souden
Best Paper and Student Paper Award Candidates
(in order of presentation at the workshop)
- O1-2 Optimal Region-of-Interest Beamforming for Audio Conferencing with Dual Perpendicular Sparse Circular Sectors
Gal Itzhak, Simon Doclo, Israel Cohen
Abstract
We introduce a region-of-interest beamforming approach for audio conferencing that addresses dynamic acoustics and multiple-speaker scenarios. The approach employs a two-stage sparse optimization to select a subset of microphones from dual circular sector arrays: first on the xz plane and then on the xy plane, balancing spatial resolution and efficiency. Using the dual circular layout, we are able to reduce response variability across azimuth and elevation angles. The proposed approach maximizes broadband directivity while ensuring a controlled level of distortion and minimal white noise gain. Compared to existing methods, the mainlobe attained by the resulting beamformer is more accurately aligned with the region of interest. It also achieves a preferable sidelobe and backlobe suppression. Finally, the proposed approach is shown to be superior considering the directivity factor and white noise gain, in particular at medium and high frequencies.
- O1-3 Physics-Informed Direction-Aware Neural Acoustic Fields
Yoshiki Masuyama, François G. Germain, Gordon Wichern, Christopher Ick, Jonathan Le Roux
Abstract
This paper presents a physics-informed neural network (PINN) for modeling first-order Ambisonic (FOA) room impulse responses (RIRs). PINNs have demonstrated promising performance in sound field interpolation by combining the powerful modeling capability of neural networks and the physical principles of sound propagation. In room acoustics, PINNs have typically been trained to represent the sound pressure measured by omnidirectional microphones where the wave equation or its frequency-domain counterpart, i.e., the Helmholtz equation, is leveraged. Meanwhile, FOA RIRs additionally provide spatial characteristics and are useful for immersive audio generation with a wide range of applications. In this paper, we extend the PINN framework to model FOA RIRs. We derive two physics-informed priors for FOA RIRs based on the correspondence between the particle velocity and the (X, Y, Z)-channels of FOA. These priors associate the predicted W-channel and other channels through their partial derivatives and impose the physically feasible relationship on the four channels. Our experiments confirm the effectiveness of the proposed method compared with a neural network without the physics-informed prior.
- O1-5 Towards Perception-Informed Latent HRTF Representations
You Zhang, Andrew Francl, Ruohan Gao, Paul Calamia, Zhiyao Duan, Ishwarya Ananthabhotla
Abstract
Personalized head-related transfer functions (HRTFs) are essential for ensuring a realistic auditory experience over headphones, because they take into account individual anatomical differences that affect listening. Most machine learning approaches to HRTF personalization rely on a learned low-dimensional latent space to generate or select custom HRTFs for a listener. However, these latent representations are typically learned in a manner that optimizes for spectral reconstruction but not for perceptual compatibility, meaning they may not necessarily align with perceptual distance. In this work, we first study whether traditionally learned HRTF representations are well correlated with perceptual relations using auditory-based objective perceptual metrics; we then propose a method for explicitly embedding HRTFs into a perception-informed latent space, leveraging a metric-based loss function and supervision via Metric Multidimensional Scaling (MMDS). Finally, we demonstrate the applicability of these learned representations to the task of HRTF personalization. We suggest that our method has the potential to render personalized spatial audio, leading to an improved listening experience.
- O2-2 TACOS: Temporally-Aligned Audio Captions for Language-Audio Pretraining
Paul Primus, Florian Schmid, Gerhard Widmer
Abstract
Learning to associate audio with textual descriptions is valuable for a range of tasks, including pretraining, zero-shot classification, audio retrieval, audio captioning, and text-conditioned audio generation. Existing contrastive language-audio pretrained models are typically trained using global, clip-level descriptions, which provide only weak temporal supervision. We hypothesize that CLAP-like language-audio models – particularly, if they are expected to produce frame-level embeddings – can benefit from a stronger temporal supervision. To confirm our hypothesis, we curate a novel dataset of approximately 12,000 audio recordings from Freesound, each annotated with single-sentence free-text descriptions linked to a specific temporal segment in an audio recording. We use large language models to clean these annotations by removing references to non-audible events, transcribed speech, typos, and annotator language bias. We further propose a frame-wise contrastive training strategy that learns to align text descriptions with temporal regions in an audio recording and demonstrate that our model has better temporal text-audio alignment abilities compared to models trained only on global captions when evaluated on the AudioSet Strong benchmark. The dataset and our source code are available on Zenodo and GitHub, respectively.
- O2-3 Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation
Xilin Jiang, Junkai Wu, Vishal Choudhari, Nima Mesgarani
Abstract
Audio large language models (LLMs) are considered experts at recognizing sound objects, yet their performance relative to LLMs in other sensory modalities, such as visual or audio-visual LLMs, and to humans using their ears, eyes, or both remains unexplored. To investigate this, we systematically evaluate audio, visual, and audio-visual LLMs, specifically Qwen2-Audio, Qwen2-VL, and Qwen2.5-Omni, against humans in recognizing sound objects of different classes from audio-only, silent video, or sounded video inputs. We uncover a performance gap between Qwen2-Audio and Qwen2-VL that parallels the sensory discrepancy between human ears and eyes. To reduce this gap, we introduce a cross-modal distillation framework, where an LLM in one modality serves as the teacher and another as the student, with knowledge transfer in sound classes predicted as more challenging to the student by a heuristic model. Distillation in both directions, from Qwen2-VL to Qwen2-Audio and vice versa, leads to notable improvements, particularly in challenging classes. This work highlights the sensory gap in LLMs from a human-aligned perspective and proposes a principled approach to enhancing modality-specific perception in multimodal LLMs.
- O2-5 Modulation Discovery with Differentiable Digital Signal Processing
Christopher Mitcheltree, Hao Hao Tan, Joshua D. Reiss
Abstract
Modulations are a critical part of sound design and music production, enabling the creation of complex and evolving audio. Modern synthesizers provide envelopes, low frequency oscillators (LFOs), and more parameter automation tools that allow users to modulate the output with ease. However, determining the modulation signals used to create a sound is difficult, and existing sound-matching / parameter estimation systems are often uninterpretable black boxes or predict high-dimensional framewise parameter values without considering the shape, structure, and routing of the underlying modulation curves. We propose a neural sound-matching approach that leverages modulation extraction, constrained control signal parameterizations, and differentiable digital signal processing (DDSP) to discover the modulations present in a sound. We demonstrate the effectiveness of our approach on highly modulated synthetic and real audio samples, its applicability to different DDSP synth architectures, and investigate the trade-off it incurs between interpretability and sound-matching accuracy. We make our code and audio samples available and provide the trained DDSP synths in a VST plugin.
- O3-1 Rethinking Non-Negative Matrix Factorization with Implicit Neural Representations
Krishna Subramani, Paris Smaragdis, Takuya Higuchi, Mehrez Souden
Abstract
Non-negative Matrix Factorization (NMF) is a powerful technique for analyzing regularly-sampled data, i.e., data that can be stored in a matrix. For audio, this has led to numerous applications using time-frequency (TF) representations like the Short-Time Fourier Transform. However extending these applications to irregularly-spaced TF representations, like the Constant-Q transform, wavelets, or sinusoidal analysis models, has not been possible since these representations cannot be directly stored in matrix form. In this paper, we formulate NMF in terms of learnable functions (instead of vectors) and show that NMF can be extended to a wider variety of signal classes that need not be regularly sampled.
- O3-2 Self-Steering Deep Non-Linear Spatially Selective Filters for Efficient Extraction of Moving Speakers under Weak Guidance
Jakob Kienegger, Alina Mannanova, Huajian Fang, Timo Gerkmann
Abstract
Recent works on deep non-linear spatially selective filters demonstrate exceptional enhancement performance with computationally lightweight architectures for stationary speakers of known directions. However, to maintain this performance in dynamic scenarios, resource-intensive data-driven tracking algorithms become necessary to provide precise spatial guidance conditioned on the initial direction of a target speaker. As this additional computational overhead hinders application in resource-constrained scenarios such as real-time speech enhancement, we present a novel strategy utilizing a low-complexity tracking algorithm in the form of a particle filter instead. Assuming a causal, sequential processing style, we introduce temporal feedback to leverage the enhanced speech signal of the spatially selective filter to compensate for the limited modeling capabilities of the particle filter. Evaluation on a synthetic dataset illustrates how the autoregressive interplay between both algorithms drastically improves tracking accuracy and leads to strong enhancement performance. A listening test with real-world recordings complements these findings by indicating a clear trend towards our proposed self-steering pipeline as preferred choice over comparable methods.
- O3-3 A Lightweight and Robust Method for Blind Wideband-to-Fullband Extension of Speech
Jan Büthe, Jean-Marc Valin
Abstract
Reducing the bandwidth of speech is common practice in resource constrained environments like low-bandwidth speech transmission or low-complexity vocoding. We propose a lightweight and robust method for extending the bandwidth of wideband speech signals that is inspired by classical methods developed in the speech coding context. The resulting model has just ~370 K parameters and a complexity of ~140 MFLOPS (or ~70 MMACS). With a frame size of 10 ms and a lookahead of only 0.27 ms, the model is well-suited for use with common wideband speech codecs. We evaluate the model’s robustness by pairing it with the Opus SILK speech codec (1.5 release) and verify in a P.808 DCR listening test that it significantly improves quality from 6 to 12 kb/s. We also demonstrate that Opus 1.5 together with the proposed bandwidth extension at 9 kb/s meets the quality of 3GPP EVS at 9.6 kb/s and that of Opus 1.4 at 18 kb/s showing that the blind bandwidth extension can meet the quality of classical guided bandwidth extensions thus providing a way for backward-compatible quality improvement.
- O3-5 Unveiling the Best Practices for Applying Speech Foundation Models to Speech Intelligibility Prediction for Hearing-Impaired People
Haoshuai Zhou, Boxuan Cao, Changgeng Mo, Linkai Li, Shan Xiang Wang
Abstract
Speech foundation models (SFMs) have demonstrated strong performance across a variety of downstream tasks, including speech intelligibility prediction for hearing-impaired people (SIP-HI). However, optimizing SFMs for SIP-HI has been insufficiently explored. In this paper, we conduct a comprehensive study to identify key design factors affecting SIP-HI performance with 5 SFMs, focusing on encoder layer selection, prediction head architecture, and ensemble configurations. Our findings show that, contrary to traditional use-all-layers methods, selecting a single encoder layer yields better results. Additionally, temporal modeling is crucial for effective prediction heads. We also demonstrate that ensembling multiple SFMs improves performance, with stronger individual models providing greater benefit. Finally, we explore the relationship between key SFM attributes and their impact on SIP-HI performance. Our study offers practical insights into effectively adapting SFMs for speech intelligibility prediction for hearing-impaired populations.