Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 3 - EVERYSOUND (Computational Analysis of Everyday Soundscapes)

Teaser

Sounds carry a large amount of information about our everyday environments and physical events that take place in them. For example, when a car is passing by, one can perceive the approximate size and speed of the car. Similar information can be obtained from many other sound...

Summary

Sounds carry a large amount of information about our everyday environments and physical events that take place in them. For example, when a car is passing by, one can perceive the approximate size and speed of the car. Similar information can be obtained from many other sound sources such as humans, animals, etc. Sound can be captured easily and non-intrusively by cheap recording devices and transmitted further – for example, tens of hours of audio is uploaded to the internet every minute e.g. in the forms of YouTube videos. Extracting information from everyday sounds is easy for humans, but today’s computational audio analysis algorithms are not able to recognize individual sounds within them.
Computational analysis of everyday soundscapes will open up vast possibilities to utilize the information encoded in sound. Automatically acquired high-level descriptions of audio will enable the development of content-based multimedia query methods, which are not feasible with today\'s technology. Robots, smart phones, and other devices able to recognize sounds will become aware of physical events in their surround- ings. Intelligent monitoring systems will be able to automatically detect events of interest (danger, crime, etc.) and classify different sources of noises. Automatic sound recognition will allow new assistive technolo- gies for the hearing impaired, for example by visualizing sounds. Producing descriptions of audio will give new tools for geographical, social, and cultural studies to analyze human and non-human activity in urban and rural areas. Acquiring descriptions of animal sounds gives tools to analyze the biodiversity on an area cost-efficiently.
The main goal of EVERYSOUND is to develop computational methods for automatically producing high- level descriptions of general sounds encountered in everyday soundscapes. The specific objectives of EVERYSOUND related to the components that contribute to main goal of the project and the whole framework are:
O1: Production of a large-scale corpus consisting of audio material and reference descriptions from a large number of everyday contexts, for development and benchmarking computational everyday soundscape analysis methods.
O2: Development of a taxonomy for sounds in everyday environments.
O3: Development of robust pattern recognition algorithms that allow recognition of multiple of co-occurring sounds that may have been distorted by reverberation and channel effects.
O4: Development of contextual models for everyday soundscapes that will take into account relationships between multiple sources, their acoustic characteristic, and the context.

Work performed

WP1 focuses on collection a corpus of everyday soundscapes, including audio and annotations of sound events. The data collection has been done using in-ear binaural microphones with a mobile recorder, and annotating all the perceived sound events in the recordings. The recordings were done in several different locations in Finland, from 15 different everyday contexts: Bus, Cafe/restaurant, Car, City center, Forest path, Grocery store, Home, Lakeside beach, Library, Metro station, Office, Residential area, Train, Tram, Urban park. For each context, recordings from several different locations were made.
The first data collection effort resulted in datasets TUT Acoustic Scenes 2016 and TUT Sound Events 2016 which were made publicly available. The datasets were later extended to TUT Acoustic Scenes 2017 and TUT Sound Events 2017, which were also made publicly available. Dissemination of the datasets was done through the DCASE challenges in the respective years, reaching a very wide audience in the research community. As main organizers and coordinators of tasks within the challenge, we also proposed a set of metrics for evaluation of sound event detection and provided a toolbox for their calculation. In addition, analysis of the challenge entries and preliminary studies of the data resulted in a number of publications that present in detail the datasets, the challenge organization and outcome, and analyze human performance in similar testing conditions. In connection with the challenge, we launched a workshop on Detection and Classification of Acoustic Scenes and Events.

WP2 focuses on developing a multilayer taxonomy for sound events. Annotation of sound events was done such that the label describes as closely as possible the sound in terms of sound production, using a noun to indicate the object or being that produced the sound, and a verb to indicate the action that produces the sound. This aligns with the way humans think about sounds for their identification, by associating the sound event with its cause and reacting accordingly.
According to the project plan, we used the hierarchical WordNet taxonomy to label both the sound sources and the sound production mechanism. For supporting evaluation and development of sound event detection methods, we used manually a suitable level in the hierarchy to have a sufficient number of examples from each class.
WP3 focuses on development pattern classification methods for mapping acoustic features to information about sound events present in the input audio. We focused mainly on new deep learning methods, which seemed to most promising technique based on the first experiments. We proposed a Convolutional Recurrent Neural Network (CRNN) based method for sound event detection. We obtained state-of-the-art results with this method in four different SED and audio tagging datasets, and we published our findings in a journal article. The developed general-purpose sound event detection method was shown successful in several more specific tasks, including two different SED challenges. These challenges were QMUL bird audio detection challenge 2017, and DCASE 2017 task 2 on rare sound event detection. We obtained second places in both challenges. Finally, we investigated on filterbank learning approaches using convolutional neural networks initialized to extract mel band energy features. While the network parameters are updated during training for better SED performance, the mel filterbank layer also converges to an ad-hoc filterbank that resembles mel filterbank and optimized for the given SED task.

We also explored sound tagging and weakly labeled sound event detection. We developed a method to learn the sound events with time boundaries from weakly labeled datasets (that provides only the sound events existing in a 10 second recording without their time boundaries). The dataset was a subset of the Google\'s recently released Audioset, amounting to about 140 h of data. The proposed method was

Final results

The project is able to establish the research field of sound classification and detection. The project will make publicly available an everyday sound corpus from a wide range of different contexts, which will provide a basis for research in the field of everyday sound processing. The project will develop a hierarchical multilayer taxonomy for describing everyday sounds whose expression capability clearly exceeds those used by the previous computational sound analysis methods.

The project will develop novel pattern recognition algorithms that are able to deal with mixtures of everyday sounds and will therefore allow pattern recognition in highly noisy scenarios, and recognition of multiple co-occurring sounds. The project will develop robust acoustic pattern classification methods to deal with training-test mismatch, which will enable easier practical deployment of sound event detection technologies. The project will also develop interactive learning methods to enable annotating sound event databases more efficiently. Novel audio captioning methods enable producing more rich descriptions of soundscapes that sound event labels can provide.