Schedule
Sunday, October 15 | Monday, October 16 | Tuesday, October 17 | Wednesday, October 18 | |
7:00 ‑ 8:00 | Breakfast | Breakfast | Breakfast | |
8:00 ‑ 8:50 | K1: Keynote Talk by Ville Pulkki | K2: Keynote Talk by Augusto Sarti | K3: Keynote Talk by Mark Plumbley | |
8:50 ‑ 10:10 | L1: Audio Event Detection and Classification | L3: Audio and Music Signal Processing | L5: Signal Enhancement | |
10:10 ‑ 10:30 | Coffee Break | Coffee Break | Coffee Break | |
10:30 ‑ 12:30 | P1: Signal Enhancement and Source Separation | P2: Array Processing | P3: Music, Audio and Speech Processing | |
12:30 ‑ 14:00 | Lunch/Afternoon Break | Lunch/Afternoon Break | Lunch/Closing | |
14:00 ‑ 16:00 | ||||
16:00 ‑ 18:00 | Registration | L2: Microphone and Loudspeaker Arrays | L4: Source Separation | |
18:00 ‑ 18:15 | ||||
18:15 ‑ 20:00 | Dinner | Dinner | Dinner | |
20:00 ‑ 22:00 | Welcome Reception | Cocktails (kindly supported by mh acoustics) | Demonstrations & Cocktails |
Sunday, October 15
Sunday, October 15, 16:00 – 18:00
Registration
Room: Mountain View Room
Sunday, October 15, 18:15 – 20:00
Dinner
Room: West Dining Room
Sunday, October 15, 20:00 – 22:00
Welcome Reception
Room: West Dining Room
Monday, October 16
Monday, October 16, 07:00 – 08:00
Breakfast
Room: West Dining Room
Monday, October 16, 08:00 – 08:50
K1: Keynote Talk by Ville Pulkki
Room: Conference House
Monday, October 16, 08:50 – 10:10
L1: Audio Event Detection and Classification
Lecture 1
Room: Conference House
- Metric Learning Based Data Augmentation for Environmental Sound Classification
-
Deep neural networks have been widely applied in the field of environmental sound classification. However, due to the scarcity of carefully labeled data, they always suffer from severe over-fitting problems. Despite designing models that have better generalization abilities, previous works also tried to enlarge the training set by various data augmentation methods which are mainly based on specific domain knowledge. These methods are always too simple and substantially increase the computation and storage complexity. In this paper, we begin with the class-conditional data augmentation which is quite an explicit inspiration from the results of brutal-force augmentation method. To further reduce the amount of data and exclude those infeasible augmented samples, we propose a framework to filter the augmented data by means of metric learning. Experiments on a widely used environmental sound dataset show that our framework achieves the same performance compared to the other augmentation strategies while reduces the amount of training data by a large margin.
- Transfer Learning of Weakly Labelled Audio
-
Many machine learning tasks have been shown solvable with impressive levels of success given large amounts of training data and computational power. For the tasks which lack data sufficient to achieve high performance, methods for transfer learning can be applied. These refer to performing the new task while having some prior knowledge of the nature of the data, gained by first performing a different task, for which training data is abundant. Previously shown successful for machine vision and natural language processing, transfer learning is investigated in this work for audio analysis. We propose to solve the problem of weak label classification (tagging) with small amounts of training data by transferring the abstract knowledge about the nature of audio data from another tagging task. Three neural network architectures are proposed and evaluated, showing impressive classification accuracy gains.
- Sound Event Detection in Synthetic Audio: Analysis of the DCASE 2016 Task Results
-
As part of the 2016 public evaluation challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2016), Task 2 focused on evaluating sound event detection systems using synthetic mixtures of office sounds. This task, which follows the `Event Detection – Office Synthetic’ task of DCASE 2013, studies the behaviour of tested algorithms when facing controlled levels of audio complexity with respect to noise and polyphony/density, with the added benefit of a very accurate ground truth. This paper presents the task formulation, evaluation metrics, submitted systems, and provides a statistical analysis of the results achieved, with respect to various aspects of the evaluation dataset.
- Learning Vocal Mode Classifiers from Heterogeneous Data Sources
-
The studies on sound event recognition are commonly based on cross-validation on a single dataset, without considering cases that training and testing data mismatches. This paper targets on a generalized vocal mode classifier (speech/singing) that works on audio data from an arbitrary data source. Multiple datasets are used as training material for generalization in acoustic content and a dataset containing both speech and singing is used for testing. The experimental results show that the classification performance is much lower (69.6%), compared to using cross-validation on the test dataset (95.5%). Feature normalization techniques are evaluated to bridge the mismatch in channel effect from heterogeneous data sources. Subdataset-wise mean-variance normalization gives the best improvement to the classifier, rising the accuracy from 69.6% to 96.8%. However, this relies on a sufficient amount of data from the recognition data source to estimate the feature distribution. The best achieved accuracy is 81.2% when the feature distribution can only be estimated recording-wise.
Monday, October 16, 10:30 – 12:30
P1: Signal Enhancement and Source Separation
Poster 1
Room: Parlor
- Multi-scale Multi-band DenseNets for Audio Source Separation
-
This paper deals with the problem of audio source separation. To handle the complex and ill-posed nature of the problems of audio source separation, the current state-of-the-art approaches employ deep neural networks to obtain the source spectra from a mixture. In this study, we propose a novel network architecture that extend the recently developed densely connected convolutional network (DenseNet), which has shown excellent results on image classification tasks. To deal with the specific problem of audio source separation, an up-sampling layer, block skip connection and band-dedicated dense blocks are incorporated on top of DenseNet. The proposed approach takes advantage of long contextual information and outperforms state-of-the-art results on SiSEC 2016 competition by a large margin in terms of Signal-to-Distortion Ratio. Moreover, the proposed architecture requires significantly fewer parameters and considerably less training time compared with other methods.
- Underdetermined Methods for Multichannel Audio Enhancement with Partial Preservation of Background Sources
-
Multichannel audio enhancement and source separation traditionally attempt to isolate a single source and remove all background noise. In listening enhancement applications, however, a portion of the background sources should be retained to preserve the listener’s spatial awareness. We describe a time-varying spatial filter designed to apply a different gain to each sound source with minimal distortion of the source spectra and spatial cues. The filter, inspired by methods from underdetermined source separation, alters its distortion weights at each time-frequency point to preserve the dominant sources. The nonstationary model allows the filter to process more sources than it otherwise could, while the partial background preservation improves robustness to noise and errors.
- A Convex Optimization Approach for Time-Frequency Mask Estimation
-
In this paper, we propose a new time-frequency mask method for computational auditory scene analysis (CASA) based on convex optimization of the binary mask. In the proposed method, the pitch estimation and segment segregation in conventional CASA are completely replaced by the convex optimization of speech power. Considering the cross-correlation between the power spectra of noisy speech and noise in each of a Gammatone filterbank channel, the objective function of speech power used for convex optimization is built. The speech power is estimated by gradient descent method. Thus, the time-frequency units dominated by speech and noise are labeled by comparing the powers of noisy and estimated speech, and noise. The erroneous local masks are also removed by using the Teager energy of the estimated speech and time-frequency unit smoothing. The results from the average segmental signal-to-noise ratio improvement, HIT-False Alarm rate and subjective test show that the performance of the proposed method outperforms the reference methods.
- Music/Voice Separation Using the 2D Fourier Transform
-
Audio source separation is the act of isolating sound sources in an audio scene. One application of source separation is singing voice extraction. In this work, we present a novel approach for music/voice separation that uses the 2D Fourier Transform (2DFT). Our approach leverages how periodic patterns manifest in the 2D Fourier Transform and is connected to research in biological auditory systems as well as image processing. We find that our system is competitive with and simpler than existing unsupervised source separation approaches that leverage similar assumptions.
- Exploiting the Intermittency of Speech for Joint Separation and Diarization of Speech Signals
-
Natural conversations are spontaneous spoken exchanges involving two or more people talking in an intermittent manner. Therefore one expects the recording of such conversation to have intervals where some of the speakers are silent. Yet, most audio source separation methods consider the sound sources to be continuously emitting on the total duration of the processed mixture. In this paper we propose a generative model for multichannel audio source separation (MASS) where the sources may have pauses. We model the activity of all sources at time frame resolution as a hidden state, the diarisation state, enabling us to activate/de-activate the sound sources at the frame level. We plug this diarisation model within the spatial covariance matrix model proposed for MASS in [1]. The proposed method shows an advantage in performance over the state of the art, when separating speech mixtures of intermittent speakers.
- A Novel Target Speaker Dependent Postfiltering Approach for Multichannel Speech Enhancement
-
In this article, we present a target speaker dependent speech enhancement system, to enhance a specific target talker in presence of real life background noises. The proposed system uses a multichannel processing stage to produce a noise reference signal. This noise reference signal is further used, to not only compute the residual noise statistics, but also to learn the noise subspace for a Non-negative Matrix Factorization (NMF) based post filtering. In our evaluations using GRID database, both speaker dependent and speaker independent state-\-of-\-the-\-art enhancers have been considered. Our proposed system not only outperforms the speaker independent systems significantly, but also shows improvement over a recently proposed speaker dependent system. Finally, an online version of the algorithm has also been proposed, with a slight compromise in the performance compared to the batch system.
- Explaining the Parameterized Wiener Filter with Alpha-Stable Processes
-
This paper introduces a new method for single-channel denoising that sheds new light on classical early developments on this topic that occurred in the 70’s and 80’s with Wiener filtering and spectral subtraction. Operating both in the short-time Fourier transform domain, these methods consist in estimating the power spectral density (PSD) of the noise without speech. Then, the clean speech signal is obtained by manipulating the corrupted time-frequency bins thanks to these noise PSD estimates. Theoretically grounded when using power spectra, these methods were subsequently generalized to magnitude spectra, or shown to yield better performance by weighting the PSDs in the so-called \textit{parameterized Wiener filter}. Both these strategies were long considered ad-hoc. To the best of our knowledge, while we recently proposed an interpretation of magnitude processing, there is still no theoretical result that would justify the better performance of parameterized Wiener filters. Here, we show how the $\alpha$-stable probabilistic model for waveforms naturally leads to these weighted filters and we provide a grounded and fast algorithm to enhance corrupted audio that compares favorably with classical denoising methods.
- An EM Algorithm for Audio Source Separation Based on the Convolutive Transfer Function
-
This paper addresses the problem of audio source separation from (possibly under-determined) multichannel convolutive mixtures. We propose a separation method based on the convolutive transfer function (CTF) in the short-time Fourier transform domain. For strongly reverberant signals, the CTF is a much more appropriate model than the widely-used multiplicative transfer function approximation. An Expectation-Maximization (EM) algorithm is proposed to jointly estimate the model parameters, including the CTF coefficients of the mixing filters, and infer the sources. Experiments show that the proposed method provides very satisfactory performance on highly reverberant speech mixtures.
- Guiding Audio Source Separation by Video Object Information
-
In this work we propose novel joint and sequential approaches for the task of single channel audio source separation using information about the sounding object’s motion. This is done within the popular non-negative matrix factorization framework. Specifically, we present methods that utilize non-negative least squares (NNLS) and canonical correlation analysis (CCA) to couple motion and audio information. The proposed techniques generalize recent work carried out on NMF-based motion-informed source separation and easily extend to video data. Experiments with two distinct multimodal datasets of string instrument performance recordings illustrate their advantages over the existing methods.
- Low-latency Approximation of Bidirectional Recurrent Networks for Speech Denoising
-
The ability to separate speech from non-stationary background disturbances using only a single channel of information has increased significantly with the adoption of deep learning techniques. In these approaches, a time-frequency mask that recovers clean speech from noisy mixtures is learned from data. Recurrent neural networks are particularly well-suited to this sequential prediction task, with the bidirectional variant (e.g., BLSTM) achieving strong results. The downside of bidirectional models is that they require offline operation to perform both a forward and backward pass over the data. In this paper we compare two different low-latency bidirectional approximations. The first uses block processing with a regular bidirectional network, while the second uses the recently proposed lookahead convolutional layer. Our results show that using just 1000 ms of backward context can recover approximately 75% of the performance improvement gained from using bidirectional as opposed to forward-only recurrent networks.
- Low Latency Sound Source Separation Using Convolutional Recurrent Neural Networks
-
Deep neural networks (DNN) have been successfully employed for the problem of monaural sound source separation achieving state-of-the art-results. In this paper, we propose using convolutional recurrent neural network (CRNN) architecture for tackling this problem. We focus on applications where low algorithmic delay ($\leq 10$ ms ) is paramount. The Danish hearing in noise test (HINT) database is used with multiple talkers. We show that the proposed architecture can achieve slightly better performance as compared to feedforward DNNs and long short term memory (LSTM) networks. In addition to reporting separation performance metrics (i.e., source to distortion ratios), we also report extended short term objective intelligibility (eSTOI) scores due to non-stationary interferers.
- PSD Estimation of Multiple Sound Sources in a Reverberant Room Using a Spherical Microphone Array
-
We propose an efficient method to estimate source power spectral densities (PSDs) in a multi-source reverberant environment using a spherical microphone array. The proposed method utilizes the spatial correlation between the spherical harmonics (SH) coefficients of a sound field to estimate source PSDs. The use of the spatial cross-correlation of the SH coefficients allows us to employ the method in an environment with a higher number of sources compared to conventional methods. Furthermore, the orthogonality property of the SH basis functions saves the effort of designing specific beampatterns of a conventional beamformer-based method. We evaluate the performance of the algorithm with different number of sources in practical reverberant and non-reverberant rooms. We also demonstrate an application of the method by separating source signals using a conventional beamformer and a Wiener post-filter designed from the estimated PSDs.
- Joint Wideband Source Localization and Acquisition Based on a Grid-Shift Approach
-
This paper addresses the problem of joint wideband localization and acquisition of acoustic sources. The source locations as well as acquisition of the original source signals are obtained in a joint fashion by solving a sparse recovery problem. Spatial sparsity is enforced by discretizing the acoustic scene into a grid of predefined dimensions. In practice, energy leakage from the source location to the neighboring grid points is expected to produce spurious location estimates, since the source location will not coincide with one of the grid points. To alleviate this problem we introduce the concept of grid-shift. A particular source is then near a point on the grid in at least one of a set of shifted grids. For the selected grid, other sources will generally not be on a grid point, but their energy is distributed over many points. A large number of experiments on real speech signals show the localization and acquisition effectiveness of the proposed approach under clean, noisy and reverberant conditions.
- Experimental Study of Robust Beamforming Techniques for Acoustic Applications
-
In this paper, we investigate robust beamforming methods for wideband signal processing in noisy and reverberant environments. In such environments, the appearance of steering vector estimation errors is inevitable, which degrades the performance of beamformers. Here, we study two types of robust beamformers against this estimation inaccuracy. The first type includes the norm constrained Capon, the robust Capon, and the doubly constrained robust Capon beamformers. The underlying principle is to add steering vector uncertainty constraint and norm constraint to the optimization problem to improve the beamformer’s robustness. The second one is the amplitude and phase estimation method, which utilizes both time and spatial smoothing to obtain robust beamforming. Experiments are presented to demonstrate the performance of the robust beamformers in acoustic environments. The results show that the robust beamformers outperform the non-robust methods in many respects: 1) robust performance in reverberation and different noise levels; 2) resilience against steering vector and covariance matrix estimation errors; and 3) better speech quality and intelligibility.
- Modulation Spectrum Based Beamforming for Speech Enhancement
-
In array signal processing, beamforming is the common technique to align either time or phase differences between multi-microphone signals, respectively in the time or the frequency domains. In this paper, we investigate the modulation domain as the solution to improve the beamforming in noise reduction. In the modulation domain, signals are decomposed into modulators and carriers and, more specifically, we apply the filtering and the spectral subtraction techniques in the modulation domain based on the properties of speech and noise. The simulation results show that the modulation processing has improved the general beamformers in noise reduction for ranges of noise and reverberation times.
- A Conformal, Helmet-Mounted Microphone Array for Auditory Situational Awareness and Hearing Protection
-
Hearing protectors often are considered detrimental to auditory situational awareness due in part to the distortions they impose on localization cues. The addition of a helmet or other head-worn equipment can exacerbate this problem by covering the ears and effectively changing the shape of the head. In an effort to design a hearing-protection system that maintains natural auditory localization performance, we have revisited the design of a conformal array of microphones integrated into a helmet as described by Chapin et al. (2004). The system involves beamforming for spatial selectivity, filtering with head-related transfer functions to reintroduce spatial auditory cues to the beamformed data, dynamic range compression to attenuate potentially damaging sound levels, and insert earphones to deliver binaural audio and provide hearing protection. All directions are auralized simultaneously to provide full spatial coverage around the wearer. In this paper we describe the design, implementation, and testing of a prototype 32-channel system, including 3D modeling and simulations, signal-processing approaches, real-time processing, and localization performance.
- Multi-channel Late Reverberation Power Spectral Density Estimation Based on Nuclear Norm Minimization
-
Multi-channel methods for estimating the late reverberation power spectral density (PSD) rely on an estimate of the direction of arrival (DOA) of the speech source or of the relative early transfer functions (RETFs) of the target signal from a reference microphone to all microphones. The DOA and the RETFs may be difficult to estimate accurately, particularly in highly reverberant and noisy scenarios. In this paper we propose a novel multi-channel method to estimate the late reverberant PSD which does not require estimates of the DOA or RETFs. The late reverberation is modeled as an isotropic sound field and the late reverberant PSD is estimated based on the eigenvalues of the prewhitened received signal PSD matrix. Experimental results demonstrate the advantages of using the proposed estimator in a multi-channel Wiener filter for speech dereverberation, outperforming a recently proposed maximum likelihood estimator both when the DOA is perfectly estimated as well as in the presence of DOA estimation errors.
- Design of Robust Two-Dimensional Polynomial Beamformers as a Convex Optimization Problem with Application to Robot Audition
-
We propose a robust two-dimensional polynomial beamformer design method, formulated as a convex optimization problem, which allows for flexible steering of a previously proposed data-independent robust beamformer in both azimuth and elevation direction. As an exemplary application, the proposed two-dimensional polynomial beamformer design is applied to a twelve-element microphone array, integrated into the head of a humanoid robot. To account for the effects of the robot’s head on the sound field, measured head-related transfer functions are integrated into the optimization problem as steering vectors. The two-dimensional polynomial beamformer design is evaluated using signal-independent and signal-dependent measures. The results confirm that the proposed design approximates the original non-polynomial beamformer design very accurately, which makes it an attractive approach for robust real-time data-independent beamforming.
Monday, October 16, 12:30 – 16:00
Lunch/Afternoon Break
Room: West Dining Room
Monday, October 16, 16:00 – 18:00
L2: Microphone and Loudspeaker Arrays
Lecture 2
Room: Conference House
- Multizone Sound Reproduction in Reverberant Environments Using an Iterative Least-Squares Filter Design Method with a Spatiotemporal Weighting Function
-
A previously proposed iterative filter design procedure, referred to as Iterative DFT-domain Inversion (IDI), is applied in the context of multizone sound reproduction in reverberant environments. The IDI approach aims at iteratively solving a least-squares problem by considering the true reproduction error in the time domain rather than narrowband errors in the frequency domain. In this paper, a spatio-temporal weighting function, that can be tailored to the problem at hand, is applied to the reproduction error in order to flexibly control the behavior of the designed rendering system, where a constraint is imposed on the broadband energy of the loudspeaker prefilters. The efficacy of the proposed spatio-temporal weighting is verified experimentally. In particular, it is shown that the acoustic contrast between the local listening areas can be significantly increased by allowing for a certain amount of reverberation in the bright zone. At the same time, the accuracy of the reproduced wave front in the bright zone remains unaffected, and undesired pre-echoes occurring prior to the desired wave front can be reduced.
- Amplitude Engineering for Beamformers with Self-Bending Directivity Based on Convex Optimization
-
Arrays producing self-bending beams were proposed in the literature recently. The self-bending property of the beam is achieved by matching the phase profile that is applied to the array elements to a self-bending wave field. This process is termed phase engineering. It has been unclear how the optimal amplitude profile can be determined as the amplitude distribution of a self-bending wave field is difficult to determine. Previous works employed educated guesses. In this paper, we apply convex optimization to perform amplitude engineering. I.e., we complement phase engineering by determining the purely real amplitude weights that minimize the norm of the amplitude weights for a given maximum beam amplitude in the dark zone around which the beam bends. We show that phase engineering by itself does not narrow down the solution space sufficiently so that the choice of control points in the dark zone has a significant impact on how well the desired self-bending property forms.
- Directional Source Modeling in Wave-based Room Acoustics Simulation
-
Wave-based modeling in room acoustics and virtualisation applications constitutes an alternative to geometric or ray-based approaches; finite difference time domain (FDTD) methods, defined over regular grids, are an excellent match to parallel architectures. Acoustic sources are typically included as monopoles or collections of monopoles, realised, in an FDTD setting, as a forcing of a single grid point. This paper is concerned with more general representations of multipole point sources, based on spatially bandlimited approximations to the Dirac delta function in 3D. As such, it becomes possible to incorporate, in a flexible manner, multipole sources without regard to direction or alignment with an underlying grid. Numerical results are presented in the case of the acoustic monopole, dipole and quadrupole.
- Frequency Domain Singular Value Decomposition for Efficient Spatial Audio Coding
-
Advances in virtual reality have generated substantial interest in accurately reproducing and storing spatial audio in the higher order ambisonics (HOA) representation, given its rendering flexibility. Recent standardization for HOA compression adopted a framework wherein HOA data are decomposed into principal components that are then encoded by standard audio coding, i.e., frequency domain quantization and entropy coding to exploit psychoacoustic redundancy. A noted shortcoming of this approach is the occasional mismatch in principal components across blocks, and the resulting suboptimal transitions in the data fed to the audio coder. Instead, we propose a framework where singular value decomposition (SVD) is performed after transformation to the frequency domain via the modified discrete cosine transform (MDCT). This framework not only ensures smooth transition across blocks, but also enables frequency dependent SVD for better energy compaction. Moreover, we introduce a novel noise substitution technique to compensate for suppressed ambient energy in discarded higher order ambisonics channels, which significantly enhances the perceptual quality of the reconstructed HOA signal. Objective and subjective evaluation results provide evidence for the effectiveness of the proposed framework in terms of both higher compression gains and better perceptual quality, compared to existing methods.
- Blind Microphone Geometry Calibration Using One Reverberant Speech Event
-
variant of the EM algorithm is employed in order to estimate separate speech and reverberation power spectral density (PSD). By matching the spatial coherence of the latter to theoretical models, the pairwise microphone distances are estimated. From this the overall geometry is computed. Simulations and lab recordings are used to show that the proposed method outperforms the related diffuse noise approach.
- Broadband DOA Estimation Using Convolutional Neural Networks Trained with Noise Signals
-
A convolution neural network (CNN) based supervised learning method for broadband DOA estimation is proposed, where the phase component of the short-time Fourier transform coefficients of the received microphone signals are directly fed into the CNN and the features required for DOA estimation are learnt during training. Since only the phase component of the input is used, the CNN can be trained with synthesized noise signals, thereby making the preparation of the training data set easier compared to using speech signals. Through experimental evaluation, the ability of the proposed noise trained CNN framework to generalize to speech sources is demonstrated. In addition, the robustness of the system to noise, small perturbations in microphone positions, as well as its ability to adapt to different acoustic conditions is investigated using experiments with simulated and real data.
Monday, October 16, 18:15 – 20:00
Dinner
Room: West Dining Room
Monday, October 16, 20:00 – 22:00
Cocktails (kindly supported by mh acoustics)
Room: West Dining Room
Tuesday, October 17
Tuesday, October 17, 07:00 – 08:00
Breakfast
Room: West Dining Room
Tuesday, October 17, 08:00 – 08:50
K2: Keynote Talk by Augusto Sarti
Room: Conference House
Tuesday, October 17, 08:50 – 10:10
L3: Audio and Music Signal Processing
Lecture 3
Room: Conference House
- Antiderivative Antialiasing, Lagrange Interpolation and Spectral Flatness
-
Aliasing is major problem in any audio signal processing chain involving nonlinearity. The usual approach to antialiasing involves operation at an oversampled rate—usually 4 to 8 times an audio sample rate. Recently, a new approach to antialiasing in the case of memoryless nonlinearities has been proposed, which relies on operations over the antiderivative of the nonlinear function, and which allows for antialiasing at audio or near-audio rates, and without regard to the particular form of the nonlinearity (i.e., polynomial, or hard clipping). Such techniques may be deduced through an application of Lagrange interpolation over unequally-spaced values, and, furthermore, may be constrained to behave as spectrally transparent “throughs” for nonlinearities which reduce to linear at low signal amplitudes. Numerical results are presented.
- An Augmented Lagrangian Method for Piano Transcription Using Equal Loudness Thresholding and LSTM-based Decoding
-
A central goal in automatic music transcription is to detect individual note events in music recordings. An important variant is informed music transcription where methods can use calibration data for the instruments in use. However, despite the additional information, the latest results reported for neural networks based methods have not exceeded an f-measure of ~80%. As a potential explanation, the informed transcription problem can be shown to be badly conditioned and thus relies on appropriate regularization. A recently proposed method employs a mixture of simple, convex regularizers (to stabilize the parameter estimation process) and more complex terms (to encourage more meaningful structure in estimates). In this paper, we present two extensions to this method. First, we integrate equal loudness curves into the parameter estimation to better differentiate real from spurious note detections. Second, we employ (Bidirectional) Long Short Term Memory networks to re-weight the likelihood of detected note constellations. Despite their simplicity, our two extensions lead to a drop of about 35% in note error rate compared to the current state-of-the-art.
- Towards End-to-end Polyphonic Music Transcription: Transforming Music Audio Directly to a Score
-
We present a neural network model that learns to produce music scores directly from audio signals. Instead of employing commonplace processing steps, such as frequency transform front-ends or temporal pitch smoothing, we show that a neural network can learn such steps on its own when presented with the appropriate training data. We show how such a network can perform monophonic transcription with high accuracy, and how it also generalizes well to transcribing polyphonic passages.
- A Note on the Implementation of Audio Processing by Short-Term Fourier Transform
-
Short-term Fourier Transform (STFT) forms the backbone of a great deal of modern digital audio processing. A number of pub-lished implementations of this process exhibit time-aliasing distortion. This paper reiterates the requirements for alias-free processing and offers a novel method of reducing aliaing.
Tuesday, October 17, 10:30 – 12:30
P2: Array Processing
Poster 2
Room: Parlor
- Colouration in 2.5D Local Wave Field Synthesis Using Spatial Bandwidth-Limitation
-
Sound Field Synthesis techniques, such as Wave Field Synthesis aim at a physically accurate reproduction of a desired sound field inside an extended listening area. This area is surrounded by loudspeakers individually driven by their respective driving signals. Due to practical limitations, artefacts impair the synthesis accuracy resulting in a perceivable change in timbre compared to the desired sound field. Recently, an approach for so-called Local Wave Field Synthesis was published which enhances the reproduction accuracy in a limited region by applying a spatial bandwidth limitation in the circular/spherical harmonics domain to the desired sound field. This paper reports on a listening experiment comparing conventional Sound Field Synthesis techniques with the mentioned approach. Also the influence of the different parametrisations for Local Wave Field Synthesis is investigated. The results show that the enhanced reproduction accuracy in Local Wave Field Synthesis leads to an improvement with regard to the perceived colouration.
- Differential Microphone Arrays for the Underwater Acoustic Channel
-
Design of underwater acoustic sensing and communication systems is a very challenging task due to several channel effects like multipath propagation and Doppler spread. In order to cope with these effects, beamforming techniques have been applied to the design of such systems. The broadband nature of acoustic systems motivates the use of beamformers with frequency-invariant beampattern. Moreover, in some cases, these systems are limited by their physical dimensions. Differential microphone arrays (DMAs) beamformers, which have been used extensively in recent years for broadband audio signals, may comply these requirements. DMAs are small-size arrays which can provide almost frequency-invariant beampatterns and high directivity. In this paper, we present a pool experiment which shows the compatibility of DMAs for the underwater acoustic channel. Additionally, we show how to compensate for the array mismatch errors leading to much better performance level and robust beamformers.
- Localization of Acoustic Sources in the Ray Space for Distributed Microphone Sensors
-
In this paper we propose a method for the localization of acoustic sources using small microphone arrays randomly place in the acoustic scene. The presented approach is based on a ray-based plane wave decomposition performed locally to each microphone array followed by a fusion of the obtain results in order to built a ray space image that is particularly advantageous for localization purposes.
- A Penalized Inequality-Constrained Minimum Variance Beamformer with Applications in Hearing Aids
-
A well known challenge in beamforming is how to suppress multiple interferences when the degrees of freedom (DoF) provided by the array are fewer than the number of sources in the environment. In this paper, we propose a beamformer design to address this challenge. Specifically, we propose a min-max beamformer design that penalizes the maximum gain of the beamformer at any interfering direction. The new design can efficiently mitigate the total interference power regardless of whether the number of interfering sources is less than the array DoF or not. We formulate this min-max beamformer design as a convex second-order cone programming (SOCP) and proposed a low complexity iterative algorithm based on alternating direction method of multipliers (ADMM) method to solve it. In the simulation, we compare the proposed beamfomer with the linearly constrained minimum variance (LCMV) beamformer and a recently proposed inequality constrained minimum variance (ICMV) beamformer. The ability of the proposed beamformer to handle more interferences is demonstrated.
- Angular Spectrum Decomposition-Based 2.5D Higher-Order Spherical Harmonic Sound Field Synthesis with a Linear Loudspeaker Array
-
This paper derives an analytical driving function to synthesize an exterior sound field described by the spherical harmonic expansion coefficients using a linear loudspeaker array. An exterior sound field is decomposed into both the spherical harmonic expansion and helical wave spectrum coefficients. The spherical harmonic expansion coefficients are analytically converted into the helical wave spectrum ones by plane wave decomposition. The angular spectrum coefficients at the synthesis reference line are then analytically obtained from the converted helical wave spectrum coefficients and they can be directly synthesized by the spectral division method with a linear loudspeaker array. The results of computer simulations indicate the effectiveness of the proposed analytical formulation.
- Extended Sound Field Recording Using Position Information of Directional Sound Sources
-
We propose a method that can record extended sound fields generated by sound sources with arbitrary directivity. Sound field recordings given as a spherical harmonic expansion are valid within a region limited by the highest expansion order. In current methods, multiple spherical microphone arrays are used for recording sound fields over large regions. Previously, we proposed a method to take advantage of the approximate knowledge of the sound source position to synthesize sound field anywhere in space from a single microphone array recording using a translation operator. This method assumes only omni-directional sound sources in the definition of the translation operator. The present research deals with the problem of extending previous results to sound sources with arbitrary radiation directivity patterns. A new translation operator is calculated taking into account that complex sound sources can be approximated as a collection of multipoles. The results of numerical simulations and actual measurements show that the proposed multipole translation operator can effectively synthesize extended sound field information from a single spherical microphone array recording.
- Asymmetric Beampatterns with Circular Differential Microphone Arrays
-
Circular differential microphone arrays (CDMAs) facilitate compact superdirective beamformers whose beampatterns are nearly frequency invariant, and allow perfect steering for all azimuthal directions. Herein, we eliminate the inherent limitation of symmetric beampatterns associated with a linear geometry, and introduce an analytical asymmetric model for Nth-order CDMAs. We derive the theoretical asymmetric beampattern, and develop the asymmetric supercardioid. In addition, an Nth-order CDMAs design is presented based on the mean-squared-error (MSE) criterion. Experimental results show that the proposed model yields optimal performance in terms of white noise gain, directivity factor, and front-to-back ratio, as well as more flexible nulls design for the interfering signals.
- Robust Phase Replication Method for Spatial Aliasing Problem in Multiple Sound Sources Localization
-
Most of multichannel sound source Direction Of Arrival (DOA) estimation algorithms suffer from spatial aliasing problems. The Interchannel Phase Differences (IPDs) are wrapped beyond the spatial aliasing frequency. A real-time algorithm is described to solve the general IPDs wrapping problem for both single source and multi-source scenarios. The algorithm can be summarized as two steps, IPDs replication and denoising. The first step replicates the obtained IPDs to all of the possible sinusoidal periods. This process can emphasize the correct DOAs of the sources when the replicated matrix is transformed into a histogram. The second step contains a post-processing method in order to suppress interference due to spatial aliasing for a more noisy environment. Theoretical analysis and experimental results prove the robustness of our method.
- Speech Enhancement Using Extreme Learning Machines
-
The enhancement of speech degraded with the nonstationary noise types that typify real-world conditions has remained a challenging problem for several decades. However, recent use of data driven methods for this task has brought great performance improvements. In this paper, we develop a speech enhancement framework based on the extreme learning machine. Experimental results show that the proposed framework is effective in suppressing additive noise. Furthermore, it is always superior to a leading minimum mean square error (MMSE) algorithm in matched noise, and exceeds the said algorithm’s performance in mismatched noise at all but the highest signal-to-noise ratio (SNR) tested.
- Continuous Measurement of Spatial Room Impulse Responses Using a Non-uniformly Moving Microphone
-
The spatial impulse responses at multiple receiver positions can be measured efficiently by using a moving microphone. While the system is continuously excited by a periodic perfect sequence, the response is captured by the microphone. The instantaneous impulse responses are then computed from the captured signal by using a time-varying system identification method. So far, uniformly moving microphones are often considered. In this paper, the continuous measurement technique is extended for non-uniformly moving microphones. The system identification is performed by reconstructing the original sound field based on the microphone signal, which constitutes a spatio-temporal sampling of the sound field. The nonuniform sampling of the sound field is interpolated by using a Lagrange polynomial. The estimate of the corresponding impulse response is given as the cross-correlation of the interpolated sound field and the excitation signal. The proposed method is evaluated by numerical simulations where the spatial room impulse responses on a circle are measured using a microphone with a fluctuating angular speed. The accuracy of the impulse responses are compared for varying interpolation orders.
- Incoherent Idempotent Ambisonics Rendering
-
Ambisonics provides a sound-field representation in a region around a point. The method is accurate within a radius that decreases inversely proportional to frequency. Traditional rendering methods lead to unnatural behavior outside this radius, as is manifested by timbre changes and poor directionality. Some methods are additionally not idempotent. We describe a new family of rendering methods.
- Comparison of Reverberation Models for Sparse Sound Field Decomposition
-
Sparse representations of sound fields have become popular in various acoustic inverse problems. The simplest models assume spatial sparsity, where a small number of sound sources are located in the near-field. However, the performance of these models deteriorates in the presence of strong reverberation. To properly treat the reverberant components, we introduce three types of reverberation models: a low-rank model, a sparse model in the plane-wave domain, and a combined low-rank+sparse model. We discuss corresponding decomposition algorithms based on ADMM convex optimization. Numerical simulations indicate that the decomposition accuracy is significantly improved by the additive model of low-rank and sparse plane wave models.
- A DNN Regression Approach to Speech Enhancement by Artificial Bandwidth Extension
-
Artificial speech bandwidth extension (ABE) is an extremely effective means for speech enhancement at the receiver side of a narrowband telephony call. First approaches have been seen incorporating deep neural networks (DNNs) into the estimation of the upper band speech representation. In this paper we propose a regression-based DNN ABE being trained and tested on acoustically different speech databases, exceeding coded narrowband speech by a so-far unseen 1.37 CMOS points in a subjective listening test.
- Comparing Modeled and Measurement-Based Spherical Harmonic Encoding Filters for Spherical Microphone Arrays
-
Spherical microphone array processing is commonly performed in a spatial transform domain, due to theoretical and practical advantages related to sound field capture and beamformer design and control. Multichannel encoding filters are required to implement a discrete spherical harmonic transform and extrapolate the captured sound field coefficients from the array radius to the far field. These spherical harmonic encoding filters can be designed based on a theoretical array model or on measured array responses. Various methods for both design approaches are presented and compared, and differences between modeled and measurement-based filters are investigated. Furthermore, a flexible filter design approach is presented that combines the benefits of previous methods and is suitable for deriving both modeled and measurement-based filters.
- Multi-microphone Acoustic Echo Cancellation Using Relative Echo Transfer Functions
-
Modern hands-free communication devices, such as smart speakers, are equipped with several microphones, and one or more loudspeakers. Thus, the most straightforward solution to reduce the acoustic echoes is to apply acoustic echo cancellation (AEC) to each microphone. Due to computational complexity constraints, the implementation of such a solution may not be realizable. To overcome this problem, a method is proposed that uses a primary estimated echo signal, obtained using state-of-the-art AEC, to compute the remaining, or secondary, acoustic echoes. To do this, relative transfer functions between secondary and primary acoustic echo signals, here referred to as relative echo transfer functions (RETFs), are estimated and employed. In this work, the acoustic echo transfer functions (AETFs) and RETFs are modelled using convolutive transfer functions. Provided that the distance between microphones is small, the RETFs can be modeled using fewer partitions than the AETFs, which reduces the overall computational complexity.
- Noise Power Spectral Density Estimation for Binaural Noise Reduction Exploiting Direction of Arrival Estimates
-
For head-mounted assistive listening devices (e.g., hearing aids), algorithms that use the microphone signals from both the left and the right hearing device are considered to be promising techniques for noise reduction, because the spatial information captured by all microphones can be exploited.
- Ray Space Analysis with Sparse Recovery
-
This work explores integrating sparse recovery methods into the ray space transform. Sparse recovery methods have proven useful in microphone array analysis of sound fields. In particular, they can provide extremely accurate estimates of source direction in the presences of multiple, simultaneous sources and noise. The ray space transform has recently emerged as a useful tool to analyse sound fields, particulary by robustly integrating information from multiple viewpoints. In this work, we present the results of numerical simulations for a linear microphone array that demonstrate the promising improvements obtained by integrating sparse recovery into the ray space transform.
- Distributed LCMV Beamforming: Considerations of Spatial Topology and Local Preprocessing
-
We consider the a coherent desired source contaminated by coherent interferences with additional background noise which is either spatially white of diffuse noise. We analyze the SNR improvement in a collocated array and in a spatially distributed array, assuming that entire data is conveyed to a fusion center. Next, considering a distributed beamformer constructed as a global MVDR applied to the outputs of local LCMVs, we analyze the SNR improvement and compare it to the SNR improvement using all inputs at a fusion center.
- Performance Analysis of a Planar Microphone Array for Three Dimensional Soundfield Analysis
-
Soundfield analysis based on spherical harmonic decomposition has been widely used in various applications; however, a drawback is the three-dimensional geometry of the microphone arrays. Recently, the design of a two-dimensional planar microphone array capable of capturing three-dimensional (3D) spatial soundfields was proposed. This design utilizes omni-directional and first order microphones to capture soundfield components that are undetectable to conventional planar omni-directional microphone arrays, thus providing the same functionality as 3D arrays designed for the same purpose. In this paper, we discuss the implementation and performance analysis of the above design with real audio recordings. We record 3D spatial soundfields from the newly developed microphone array and study its performance in soundfield analysis methods such as direction-or-arrival estimation and source separation. The performance of the array is also compared with a commercially available spherical microphone array.
Tuesday, October 17, 12:30 – 16:00
Lunch/Afternoon Break
Room: West Dining Room
Tuesday, October 17, 16:00 – 18:00
L4: Source Separation
Lecture 4
Room: Conference House
- Deep Recurrent NMF for Speech Separation by Unfolding Iterative Thresholding
-
In this paper, we propose a novel recurrent neural network architecture for speech separation. This architecture is constructed by unfolding the iterations of the sequential iterative soft-thresholding algorithm (ISTA) that solves the optimization problem for sparse nonnegative matrix factorization (NMF) of spectrograms. We name this network architecture deep recurrent NMF (DR-NMF). The proposed DR-NMF network has three distinct advantages. First, DR-NMF provides better interpretability than other deep architectures, since the weights correspond to NMF model parameters, even after training. This interpretability also provides principled initializations that enable faster training and convergence to better solutions compared to conventional random initialization. Second, like many deep networks, DR-NMF is an order of magnitude faster at test time than NMF, since computation of the network output only requires evaluating a few layers at each time step. Third, when a limited amount of training data is available, DR-NMF exhibits stronger generalization and separation performance compared to sparse NMF and state-of-the-art long-short term memory (LSTM) networks. When a large amount of training data is available, DR-NMF achieves lower yet competitive separation performance compared to LSTM networks.
- Lévy NMF for Robust Nonnegative Source Separation
-
Source separation, which consists in decomposing data into meaningful structured components, is an active research topic in many fields including music signal processing. In this paper, we introduce the Positive alpha-stable (PaS) distributions to model the latent sources, which are a subclass of the stable distributions family. They notably permit us to model random variables that are both nonnegative and impulsive. Considering the Lévy distribution, the only PaS distribution whose density is tractable, we propose a mixture model called Lévy Nonnegative Matrix Factorization (Lévy NMF). This model accounts for low-rank structures in nonnegative data that possibly has high variability or is corrupted by very adverse noise. The model parameters are estimated in a maximum-likelihood sense. We also derive an estimator of the sources, which extends the validity of the Wiener filtering to the PaS case. Experiments on synthetic data and realistic music signals show that Lévy NMF compares favorably with state-of-the art techniques in terms of robustness to impulsive noise and highlight its potential for decomposing nonnegative data.
- Separating Time-Frequency Sources from Time-Domain Convolutive Mixtures Using Non-negative Matrix Factorization
-
This paper addresses the problem of under-determined audio source separation in multichannel reverberant mixtures. We target a semi-blind scenario assuming that the mixing filters are known. Source separation is performed from the time-domain mixture signals in order to accurately model the convolutive mixing process. The source signals are however modeled as latent variables in a time-frequency domain. In a previous paper we proposed to use the modified discrete cosine transform. The present paper generalizes the method to the use of the odd-frequency short-time Fourier transform. In this domain, the source coefficients are modeled as centered complex Gaussian random variables whose variances are structured by means of a non-negative matrix factorization model. The inference procedure relies on a variational expectation-maximization algorithm. In the experiments we discuss the choice of the source representation and we show that the proposed approach outperforms two methods from the literature.
- Consistent Anisotropic Wiener Filtering for Audio Source Separation
-
For audio source separation applications, it is common to apply a Wiener-like filtering to a Time-Frequency (TF) representation of the data, such as the Short-Time Fourier Transform (STFT). This approach, which boils down to assigning the phase of the original mixture to each component, is limited when sources overlap in the TF domain. In this paper, we propose a more sophisticated version of this technique for improved phase recovery. First, we model the sources by anisotropic Gaussian variables, which accounts for a phase property that originates from a sinusoidal model. Then, we exploit the STFT consistency, which is the relationship between STFT coefficients that is due to its redundancy. We derive a conjugate gradient algorithm for estimating the corresponding filter, called Consistent Anisotropic Wiener. Experiments conducted on music pieces show that accounting for those two phase properties outperforms each approach taken separately.
- Predicting Algorithm Efficacy for Adaptive Multi-Cue Source Separation
-
Audio source separation is the process of decomposing a signal containing sounds from multiple sources into a set of signals, each from a single source. Source separation algorithms typically leverage assumptions about correlations between audio signal characteristics (“cues”) and the audio sources or mixing parameters, and exploit these to do separation. We train a neural network to predict quality of source separation, as measured by Signal to Distortion Ratio, or SDR. We do this for three source separation algorithms, each leveraging a different cue – repetition, spatialization, and harmonicity/pitch proximity. Our model estimates separation quality using only the original audio mixture and separated source output by an algorithm. These estimates are reliable enough to be used to guide switching between algorithms as cues vary. Our approach for separation quality prediction can be generalized to arbitrary source separation algorithms.
- The Selection of Spectral Magnitude Exponents for Separating Two Sources is Dominated by Phase Distribution Not Magnitude Distribution
-
Separating an acoustic signal into desired and undesired components is an important and well-established problem. It is commonly addressed by decomposing spectral magnitudes after exponentiation and the choice of exponent has been studied from numerous perspectives. We present this exponent selection problem as an approximation to the actual underlying trigonometric situation. This approach makes apparent numerous basic facts and some of these have been ignored or violated in other exponent selection efforts. We show that exponent selection is dominated by the phase distribution and that magnitude distributions have almost no influence. We also show that exponents can be much more effectively selected in the estimated-source domain, rather than in the domain of the combined sources. Finally we describe the mechanism that causes exponents slightly above 1.0 to be preferred in many cases, completely independent of source distributions.
Tuesday, October 17, 18:15 – 20:00
Dinner
Room: West Dining Room
Tuesday, October 17, 20:00 – 22:00
Demonstrations & Cocktails
Room: West Dining Room
- Dolby’s Binaural Rendering and Mastering Unit
-
As consumers listen to more and more content over headphones, and with the rise of virtual reality (VR), there comes a need for efficient and intuitive authoring tools for spatial audio content on headphones. The Dolby Rendering and Mastering Unit (RMU) was originally built for authoring object-based immersive cinematic content in Dolby Atmos. The experience gained from working with studios to create over 100 cinematic titles was combined with Dolby’s knowledge in headphone rendering technology to create the Binaural RMU (B-RMU). The B-RMU is a prototype platform enabling the creation of immersive content for consumption over headphones. This live demo showcases the B-RMU’s integration with the Avid Pro Tools mixing software for binaural monitoring and content authoring.
- Realtime Binaural Sound Enhancement with Low Complexity and Reduced Latency
-
The theme of this demonstration is binaural sound acquisition and reproduction with built-in noise reduction for hearing assistance. The example algorithm under consideration relies on two-channel minimum-eigenvector adaptive blocking to obtain a spectral noise reference, and on a cue-preserving MMSE filter for enhancing the input signal. In order to verify computational efficiency of the sound processing, the demo uses a single-chip (ARM Cortex-A7) ‘Raspberry Pi’ target computer with Linux OS, low-cost external USB/ALSA sound, and Matlab/Simulink support on a host PC. The approach turns out very convenient in terms of coding, deployment, and wireless control of the algorithm from the host, while the audio I/O and algorithm execution on the target is eventually independent from the host. The sound demo is encouraging in terms of the signal enhancement for the hearing impaired or the spatial-sound navigation in realistic acoustic environments, including the effect of ambient noise reduction and dereverberation. The own-voice perception of the user, however, requires a continued effort to reduce the audio latency of the buffered soft-realtime system. Our demonstration hence explores upgrades on hardware, sound-driver, and algorithm parameters in order to approach the actual requirements of the application in terms of low latency and complexity.
- Real-time Acoustic Event Detection System Based on Non-Negative Matrix Factorization
-
A real-time acoustic event detection system based on non-negative matrix factorization is presented. The system consists of two functions; “training” to design classifiers from recorded data and “detection” to classify recorded sound into acoustic events. The system is implemented on a standard laptop PC and detects acoustic events in real-time. On-site training/detection of acoustic events is performed in addition to detection using pre-trained classifiers (bell ring, small drum).
- B360 – a B-format Microphone Array for Small Mobile Devices
- DCASE 2017: Detection and Classification of Acoustic Scenes and Events Challenge Summary
-
Computational analysis of acoustic scenes and events has recently gained a significant research attention, and has numerous potential applications. Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge was organized to compare different methods using publicly available datasets, common metrics, and evaluation procedures. The Challenge consisted of four task: Acoustic scene classification, Detection of rare sound events, Sound event detection in real life audio, and Large-scale weakly supervised sound event detection for smart cars. This presentation will give an overview of the setup of the tasks, including the problem to be solved, datasets used, evaluation procedure, and metrics. It will also summarize the results of each tasks, and give a short analysis of submitted systems. Audio samples will be available to demonstrate the data used.
- LOCATA 2018: Acoustic Source Localization and Tracking Challenge
-
This demonstration will announce the launch and data release of the IEEE AASP challenge on acoustic source LOCalization And TrAcking (LOCATA). LOCATA aims at providing researchers in source localization and tracking with the opportunity to objectively benchmark results against state-of-the-art algorithms using a common, publically released data corpus. The LOCATA data corpus encompasses different realistic scenarios in an enclosed acoustic environment with an emphasis on dynamic scenarios. The challenge consists of six tasks, ranging from localization of a single, static loudspeaker using static microphone arrays to multiple, moving human talkers using moving microphone arrays. All recordings contained in the corpus were made in a reverberant acoustic environment in the presence of ambient noise. The ground truth data for the positions of the sensors and sources were determined by an optical tracking system. Four different acoustic sensor configurations were considered for the recordings: hearing aids on a dummy head, a 15-channel linear array, a 32-channel spherical array, and a 12-channel pseudo-spherical array of a robot head. This demonstration will provide an overview of the LOCATA challenge, and will showcase the use of the data corpus for participation and algorithm evaluation.
Room: Mountain View Room
- Immersive Listening Session
-
The demonstration consists of an immersive listening session of a performance of a string quartet in circular configuration. The quartet use the famous “Quartetto di Cremona”, playing four Stradivarius instruments in the Arvedi Auditorium of Cremona.
Wednesday, October 18
Wednesday, October 18, 07:00 – 08:00
Breakfast
Room: West Dining Room
Wednesday, October 18, 08:00 – 08:50
K3: Keynote Talk by Mark Plumbley
Room: Conference House
Wednesday, October 18, 08:50 – 10:10
L5: Signal Enhancement
Lecture 5
Room: Conference House
- Low Complexity Kalman Filter for Multi-Channel Linear Prediction Based Blind Speech Dereverberation
-
Multi-channel linear prediction (MCLP) has been shown to be a suitable framework for tackling the problem of blind speech dereverberation. In recent years, a number of adaptive MCLP algorithms have been proposed, whereby the majority is based on the short-time fourier transformation (STFT) domain representation of the dereverberation problem. In this paper, we focus on the STFT based Kalman filter solution to the adaptive MCLP task. As all the other available adaptive MCLP algorithms operating in the STFT domain, the Kalman filter exhibits a quadratic computational cost in the number of filter coefficients per frequency bin. Aiming at a reduced complexity, we propose a simplification to the Kalman filter solution, leading to a linear cost instead. Further, we apply a Wiener gain spectral post-processor subsequent to MCLP, which is designed from readily available power spectral density (PSD) estimates. The convergence behavior of the state-of-the-art and the complexity reduced algorithm is evaluated by means of two objective measures, perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI), showing only minor performance degradation for the latter, and hence justifying the simplification.
- Dynamic Range Compression for Noisy Mixtures Using Source Separation and Beamforming
-
Dynamic range compression is widely used in digital hearing aids, but it performs poorly in noisy conditions with multiple sources. We propose a system that combines source separation, compression, and beamforming to compress each source independently. We derive an expression for a speech distortion weighted multichannel Wiener filter that performs both beamforming and compression. Experiments using recorded speech and behind-the-ear hearing aid impulse responses suggest that the combined system provides more accurate dynamic range compression than a conventional compressor in the presence of competing speech and background noise.
- Amplitude and Phase Dereverberation of Harmonic Signals
-
While most dereverberation methods focus on how to estimate the magnitude of an anechoic signal in the time-frequency domain, we propose a method which also takes the phase into account. By applying a harmonic model to the anechoic signal, we derive a formulation to compute the amplitude and phase of each harmonic. These parameters are then estimated by our method in presence of reverberation. As we jointly estimate the amplitude and phase of the clean signal, we achieve a very strong dereverberation, resulting in a significant improvement of standard dereverberation objective measures over the state-of-the-art.
- Audio Soft Declipping Based on Weighted L1-Norm
-
This paper addresses the problem of soft clipping in audio and speech signals, in which the distortion can be modeled as a memoryless polynomial. Recent proposals have shown that the sparsity of the original signal can be explored in order to perform blind compensation of the nonlinear distortion. In this paper, we introduce a weighted $L_1$-norm objective function that captures both the sparsity and spectrum profile of the original signal. In our proposal, the weights are calculated from the distorted signal which renders the method robust to different signal characteristics. Our proposal achieves a substantial gain in quality in comparison to recent sparsity-based soft declipping approaches.
Wednesday, October 18, 10:30 – 12:30
P3: Music, Audio and Speech Processing
Poster 3
Room: Parlor
- IMINET: Convolutional Semi-Siamese Networks for Sound Search by Vocal Imitation
-
Searching sounds by text labels is often difficult, as text labels cannot provide sufficient information for the sound content. Previously we proposed to use vocal imitation as sound searching query. Vocal imitation is widely used in everyday communications and it can be applied to novel human-computer interactions. In this paper, we further propose an novel architecture called IMINET to improve the search performance. IMINET is a Convolutional Siamese Network (CSN) taking both vocal imitation and the real sound as inputs to its two towers with partially untied weights. After automatic feature extraction by the convolutional layers within each tower, the two feature representations are merged and followed by fully connected networks for metric learning. The CSN is pre-trained using training vocal imitations and corresponding sound recordings not in the library to automatically learn feature and metric representations. Experiments show that IMINET outperforms a previously proposed IMISOUND system using Stacked Auto-Encoder (SAE) for feature extraction and the combination of Kullback-Leibler (K-L) divergence with Dynamic Time Warping (DTW) for metric learning. It also shows that IMINET sound retrieval performance is improved by data augmentation of the original dataset.
- Leveraging Repetition to Do Audio Imputation
-
In this work we propose an imputation method that leverages repeating structures in audio, which are a common element in music. This work is inspired by the REpeating Pattern Extraction Technique (REPET), which is a blind audio source separation algorithm designed to separate repeating “background” elements from non-repeating “foreground” elements. Here, as in REPET, we construct a model of the repeating structures by overlaying frames and calculating a median value for each time-frequency bin within the repeating period. Instead of using this model to do separation, we show how this median model can be used to impute missing time-frequency values. This method requires no pre-training and can impute in scenarios where missing or corrupt frames span the entire audio spectrum. Human evaluation results show that this method produces higher quality imputation than existing methods in signals with a high amount of repetition.
- A Kalman-Based Fundamental Frequency Estimation Algorithm
-
Fundamental frequency estimation is an important task in speech and audio analysis. Harmonic model-based methods typically have superior estimation accuracy. However, such methods usually assume that the fundamental frequency and amplitudes are stationary over a short time frame. In this paper, we propose a Kalman filter-based fundamental frequency estimation algorithm using the harmonic model, where the fundamental frequency and amplitudes can be truly nonstationary by modeling their time variations as first-order Markov chains. The Kalman observation equation is derived from the harmonic model and formulated as a compact nonlinear matrix form, which is further used to derive an extended Kalman filter. Detailed and continuous fundamental frequency and amplitude estimates for speech, the sustained vowel /a/ and solo musical tones with vibrato are demonstrated.
- Assessment of Human and Machine Performance in Acoustic Scene Classification: DCASE 2016 Case Study
-
Human and machine performance in acoustic scene classification is examined through a parallel experiment using TUT Acoustic Scenes 2016 dataset. The machine learning perspective is presented based on the systems submitted for the 2016 challenge on Detection and Classification of Acoustic Scenes and Events. The human performance, assessed through a listening experiment, was found to be significantly lower than machine performance. Test subjects exhibited different behavior throughout the experiment, leading to significant differences in performance between groups of subjects. An expert listener trained for the task obtained similar accuracy to the average of submitted systems, comparable also to previous studies of human abilities in recognizing everyday acoustic scenes.
- Speech Coding with Transform Domain Prediction
-
We show how model based prediction can be employed in the construction of a speech codec which operates entirely in the frequency domain of a Modified Discrete Cosine Transform (MDCT). The codec tools described in this paper are part of the Dolby AC-4 system standardized by ETSI and ATSC 3.0.
- Voice Conversion Based on a Mixture Density Network
-
This paper presents a new voice conversion (VC) algorithm based on a Mixture Density Network (MDN). MDN is the combination of a Gaussian Mixture Model (GMM) and an Artificial Neural Network (ANN), where the parameters of the GMM are estimated by using the ANN method instead of the Expectation Maximization (EM) algorithm. This characteristic helps the MDN estimate GMM parameters more accurately, which results in lower distortion in the converted speech. To apply the MDN to VC, we combine the MDN with Maximum Likelihood Estimation, employing a Global Variance modification (MLEGV) method. Objective results show better performance for the proposed MDN method compared with MLE and Joint Density GMM (JDGMM) methods. Subjective experiments demonstrate that the proposed method outperforms the MLE-GV and JDGMM-GV in terms of speech quality and speaker individuality
- Source Rendering on Dynamic Audio Displays
-
In this paper we describe the “audio display” – a sound reproduction device that employs the bending vibrations of a panel, such as a display screen or projection surface, to generate sound waves. We have demonstrated that it is possible to create spatially localized sound sources anywhere on the surface of a panel by using both an array of force actuators to selectively control the panel’s vibrational bending modes, and the principles of modal superposition. A filter is designed for each actuator that maintains the modal weights needed to reconstruct the source region based upon the resonant properties of the panel and the audio signal. The modal weights can be determined for a target vibration region via Fourier decomposition. The capabilities of the audio display may be greatly enhanced by enabling real-time motion of localized audio sources throughout the surface of the display. A computationally efficient method for doing this is described in this paper. Laser vibrometer measurements on a prototype panel show that this method can be used to effectively move sound-sources to new spatial locations on the surface of the display, giving a dynamic aspect to the audio display that may be applied for audio/image pairing, in which sound sources are aligned with their corresponding visual images.
- Zero-Delay Large Signal Convolution Using Multiple Processor Architectures
-
Zero latency convolution typically uses the direct form approach, requiring a large amount of computational resources for every additional sample in the impulse response. A number of methods have been developed to reduce the computational cost of very large signal convolution, however these all introduce latency into the system. In some scenarios this is not acceptable and must be removed. Modern computer systems hold multiple processor architectures, with their own strengths and weaknesses for the purpose of convolution. This paper shows how correctly combining these processors, a very powerful system can be deployed for real-time, zero-latency large signal convolution.
- Scaper: A Library for Soundscape Synthesis and Augmentation
-
Sound event detection (SED) in environmental recordings is a key topic of research in machine listening, with applications to noise monitoring in smart cities, self-driving cars, bioacoustic monitoring, and indexing of large multimedia collections such as YouTube, to name some. Developing new solutions for SED often relies on the availability of strongly labeled audio recordings, where the annotation includes the onset, offset and source of every sound event in the recording. Generating such precise annotations manually is very time consuming, and as a result existing datasets for SED with strong labels are scarce and limited in size. Strongly labeled soundscapes are also required for experiments on crowdsourcing audio annotations, since accurate reference annotations are needed in order to evaluate human labeling performance. To address these issues, we present Scaper, an open-source library for soundscape synthesis and augmentation. Given a collection of isolated sound events, Scaper acts as a high-level sequencer that can generate multiple soundscapes from a single, probabilistically defined, soundscape “specification”. To increase the variability of the output, Scaper supports the application of audio transformations such as pitch shifting and time stretching individually to every sound event. To illustrate the potential of the library, we generate a dataset of 10,000 ten-second soundscapes (almost 30 hours) and compare the performance of state-of-the-art algorithms on the new dataset, URBAN-SED, which is made freely available online. We also describe how Scaper was used to generate audio stimuli for an audio labeling crowdsourcing experiment. The paper concludes with suggestions for future improvements and the potential of using Scaper as an augmentation technique for existing datasets.
- Transient-to-Noise Ratio Restoration of Coded Applause-Like Signals
-
Coding of signals with high amount of dense transient events, like applause, rain drops, etc., is usually a difficult task for perceptual audio coders. Especially at low bit rates, coding (and the associated coding noise) leads to smeared transients and a perceived increase in noise-like signal character. We propose a post-processing technique for coded/decoded applause-like signals which is based on a signal decomposition into foreground (transient) clap events and a more noise-like background part. The proposed method aims at restoring the transient-to-noise ratio of the decoded signal using a static frequency-dependent correction profile. The method is shown to provide a significant improvement in subjective audio quality (MUSHRA score) when applied to signals originating from a state-of-the-art audio codec in different configurations.
- Diagonal RNNs in Symbolic Music Modeling
-
In this paper, we propose a new Recurrent Neural Network (RNN) architecture. The novelty is simple: We use diagonal recurrent matrices instead of full. This results in better test likelihood and faster convergence compared to regular full RNNs in most of our experiments. We show the benefits of using diagonal recurrent matrices with popularly used LSTM and GRU architectures as well as with the vanilla RNN architecture, on four standard symbolic music datasets.
- Deep Recurrent Mixture of Experts for Speech Enhancement
-
Enhancing speech with a deep neural network (DNN) approach has recently attracted the attention of the research community. The most common approach, is to feed the noisy speech features into a fully-connected DNN to enhance the speech signal or to infer a mask which can be used for the speech enhancement. In this case, one network has to deal with the large variability of the speech signal. Most approaches also discard the speech continuity. In this paper we propose a deep recurrent mixture of experts (DRMoE) architecture that addresses these two issues. In order to reduce the large speech variability, we split the network into a mixture of networks (experts), each of which specializes in a specific and simpler task. The time-continuity of the speech signal is taken into account by implementing the experts as a recurrent neural networks (RNN). Experiments demonstrate the applicability of the proposed method to speech enhancement.
- Fast Reconstruction of Sparse Relative Impulse Responses via Second-Order Cone Programming
-
The paper addresses the estimation of the relative transfer function (RTF) using incomplete information. For example, an RTF estimate might be recognized as too inaccurate in a number of frequency bins. When these values are dropped, an incomplete RTF is obtained. The goal is then to reconstruct a complete RTF estimate, based on (1) the remaining values, and (2) the sparsity of the relative impulse response, which is the time-domain counterpart of the RTF. We propose two fast algorithms for the RTF reconstruction that solve a second-order cone program (SOCP), and show their advantages over the LASSO formulation previously proposed in the literature. Simulations with speech signals show that in terms of speed and accuracy, the proposed algorithms are comparable with the LASSO solution and considerably faster compared to the generic ECOS solver. The new algorithms are, moreover, easier to control through their parameters, which brings their improved stability when the number of reliable frequency bins is very low (less than 10%).
- QRD Based MVDR Beamforming for Fast Tracking of Speech and Noise
-
We address a fully dynamic scenario by proposing a unique adaptation scheme. We target on implementing efficient noise and speaker tracking of the MVDR beamformer, while decoupling the two tracking mechanisms, such that distortion does not have to be compromised for noise reduction and vice versa.
- Automated Audio Captioning with Recurrent Neural Networks
-
We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder. The classification layer and the alignment model are fully connected layers with shared weights between timesteps. The proposed method is evaluated using data drawn from a commercial sound effects library, ProSound Effects. The resulting captions were rated through metrics utilized in machine translation and image captioning fields. Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.
- Enhancement of Ambisonic Binaural Reproduction Using Directional Audio Coding with Optimal Adaptive Mixing
-
Headphone reproduction of recorded spatial sound scenes is of great interest to immersive audiovisual applications. Directional Audio Coding (DirAC) is an established perceptually-motivated parametric spatial audio coding method, that can achieve high-quality headphone reproduction surpassing popular non-parametric methods such as first-order Ambisonics (FOA), using the same audio input \cite{Laitinen}. The early incarnation of headphone DirAC was limited to FOA input and it achieved binaural rendering through a virtual loudspeaker approach, resulting in a high computational overhead. We propose an improved DirAC method that directly synthesizes the binaural cues based on the estimated parameters. The method also accommodates higher-order B-format signals and has reduced computational requirements, suitable for lightweight processing with fast update rates and head-tracking support. According to listening tests, when using only first-order signals the method performs equally well or better to third-order Ambisonics and far surpasses FOA.
- Optimizing Differentiated Discretization for Audio Circuits Beyond Driving Point Transfer Functions
-
One goal of Virtual Analog modeling of audio circuits is to produce digital models whose behavior matches analog prototypes as closely as possible. Discretization methods provide a systematic approach to generate such models but they introduce frequency response error, such as frequency warping for the trapezoidal method. Recent work showed how using different discretization methods for each reactive element could reduce such error for driving point transfer functions. It further provided a procedure to optimize that error according to a chosen metric through joint selection of the discretization parameters. Here, we extend that approach to the general case of transfer functions with one input and an arbitrary number of outputs expressed as linear combinations of the network variables, and we consider error metrics based on the L2 and the L1 norms. To demonstrate the validity of our approach, we apply the optimization procedure for the response of a Hammond organ vibrato/chorus ladder filter, a 19-output, 36th order filter, where each output frequency response presents many features spread across its passband.
Wednesday, October 18, 12:30 – 14:00
Lunch/Closing
Room: West Dining Room