"""TUT Sound events 2017 Dataset Loader
.. admonition:: Dataset Info
:class: dropdown
**TUT Sound events 2017, Development and Evaluation datasets**
`Audio Research Group,
Tampere University of Technology <http://arg.cs.tut.fi/>`__
*Authors*
* `Toni Heittola <http://www.cs.tut.fi/~heittolt/>`__
* `Annamaria Mesaros <http://www.cs.tut.fi/~mesaros/>`__
* `Tuomas Virtanen <http://www.cs.tut.fi/~tuomasv/>`__
*Recording and annotation*
* Eemi Fagerlund
* Aku Hiltunen
*Links*
* `Development dataset <https://zenodo.org/record/814831>`__
* `Evaluation dataset <https://zenodo.org/record/1040179>`__
*Dataset*
TUT Sound Events 2017 dataset consists of two subsets: development dataset
and evaluation dataset. Partitioning of data into these subsets was done
based on the amount of examples available for each sound event class, while
also taking into account recording location. Because the event instances
belonging to different classes are distributed unevenly within the
recordings, the partitioning of individual classes can be controlled only
to a certain extent, but so that the majority of events are in the
development set.
A detailed description of the data recording and annotation procedure is available in:
.. code-block:: latex
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen.
"TUT database for acoustic scene classification and sound event
detection", In 24th European Signal Processing Conference 2016,
Budapest, Hungary, 2016.
TUT Sound events 2017, development and evaluation datasets consist of 24
and 8 audio recordings from a single acoustic scene respectively:
* Development: Street (outdoor), totaling 1:32:08
* Evaluation: Street (outdoor), totaling 29:09
The dataset was collected in Finland by Tampere University of Technology
between 06/2015 - 01/2016. The data collection has received funding from
the European Research Council under the `ERC <https://erc.europa.eu/1>`_
Grant Agreement 637422 EVERYSOUND.
*Preparation of the dataset*
The recordings were captured each in a different location (different
streets). The equipment used for recording consists of a binaural
`Soundman OKM II Klassik/studio A3 <http://www.soundman.de/en/products/>`_
electret in-ear microphone and a `Roland Edirol R-09
<http://www.rolandus.com/products/r-09/>`_ wave recorder using 44.1 kHz
sampling rate and 24 bit resolution.
For audio material recorded in private places, written consent was
obtained from all people involved. Material recorded in public places
(residential area) does not require such consent.
Individual sound events in each recording were annotated by a research
assistant using freely chosen labels for sounds. The annotator was trained
first on few example recordings. He was instructed to annotate all audible
sound events, and choose event labels freely. This resulted in a large set
of raw labels. Mapping of the raw labels was performed, merging sounds into
classes described by their source before selecting target classes. Target
sound event classes for the dataset were selected based on the frequency
of the obtained labels, resulting in selection of most common sounds for
the street acoustic scene, in sufficient numbers for learning acoustic
models. Mapping of the raw labels was performed, merging sounds into
classes described by their source, for example "car passing by",
"car engine running", "car idling", etc into "car", sounds produced by
buses and trucks into "large vehicle", "children yelling" and
"children talking" into "children", etc.
Due to the high level of subjectivity inherent to the annotation process,
a verification of the reference annotation was done using these mapped
classes. Three persons (other than the annotator) listened to each audio
segment annotated as belonging to one of these classes, marking agreement
about the presence of the indicated sound within the segment.
Agreement/disagreement did not take into account the sound event onset and
offset, only the presence of the sound event within the annotated segment.
Event instances that were confirmed by at least one person were kept,
resulting in elimination of about 10% of the original event instances in
the development set.
The original metadata file is available in the directory `non_verified`.
The ground truth is provided as a list of the sound events present in the
recording, with annotated onset and offset for each sound instance.
Annotations with only targeted sound events classes are in the directory
`meta`.
*Event statistics*
The sound event instance counts for the dataset are shown below.
*Development set*
+------------------+------------+----------------+--------------------+
| | Development dataset | Evaluation dataset |
+------------------+------------+----------------+--------------------+
| Event label |Verified set|Non-verified set| Verified set |
+==================+============+================+====================+
| brakes squeaking | 52 | 59 | 23 |
+------------------+------------+----------------+--------------------+
| car | 304 | 304 | 106 |
+------------------+------------+----------------+--------------------+
| children | 44 | 58 | 15 |
+------------------+------------+----------------+--------------------+
| large vehicle | 61 | 61 | 24 |
+------------------+------------+----------------+--------------------+
| people speaking | 89 | 117 | 37 |
+------------------+------------+----------------+--------------------+
| people walking | 109 | 130 | 42 |
+------------------+------------+----------------+--------------------+
| **Total** | **659** | **729** | **247** |
+------------------+------------+----------------+--------------------+
*Usage*
Partitioning of data into **development dataset** and **evaluation
dataset** was done based on the amount of examples available for each event
class, while also taking into account recording location. Ideally the
subsets should have the same amount of data for each class, or at least the
same relative amount, such as a 70-30% split. Because the event instances
belonging to different classes are distributed unevenly within the
recordings, the partitioning of individual classes can be controlled only
to a certain extent.
The split condition was relaxed so that 65-75% of instances of each class
were selected into the development set.
*Cross-validation setup*
The setup is provided with the dataset in the directory `evaluation_setup`.
*License*
See file `EULA.pdf
<https://github.com/TUT-ARG/DCASE2017-baseline-system/blob/master/EULA.pdf>`_
"""
import os
from typing import BinaryIO, Optional, TextIO, Tuple
import librosa
import numpy as np
import csv
# import jams
from soundata import download_utils, jams_utils, core, annotations, io
BIBTEX = """
@inproceedings{Mesaros:DCASE:17,
Address = {Munich, Germany},
Author = {Mesaros, A. and Heittola, T. and Diment, A. and Elizalde, B. and
Shah, A. and Vincent, E. and Raj, B. and Virtanen, T.},
Booktitle = {Proceedings of the Detection and Classification of Acoustic
Scenes and Events 2017 Workshop (DCASE2017)},
Month = {November},
Pages = {85--92},
Title = {{DCASE} 2017 Challenge Setup: Tasks, Datasets and Baseline
System},
Year = {2017}}
"""
REMOTES = {
"development.audio.1": download_utils.RemoteFileMetadata(
filename="TUT-sound-events-2017-development.audio.1.zip",
url=(
"https://zenodo.org/record/814831/files/TUT-sound-events-2017-"
"development.audio.1.zip?download=1"
),
checksum="6f1cd31592b8240a14be3ee513db6a23",
),
"development.audio.2": download_utils.RemoteFileMetadata(
filename="TUT-sound-events-2017-development.audio.2.zip",
url=(
"https://zenodo.org/record/814831/files/TUT-sound-events-2017-"
"development.audio.2.zip?download=1"
),
checksum="adcff03341b84dc8d35f035b93c1efa0",
),
"development.doc": download_utils.RemoteFileMetadata(
filename="TUT-sound-events-2017-development.doc.zip",
url=(
"https://zenodo.org/record/814831/files/TUT-sound-events-2017-"
"development.doc.zip?download=1"
),
checksum="aa6024e70f5bff3fe15d962b01753e23",
),
"development.meta": download_utils.RemoteFileMetadata(
filename="TUT-sound-events-2017-development.meta.zip",
url=(
"https://zenodo.org/record/814831/files/TUT-sound-events-2017-"
"development.meta.zip?download=1"
),
checksum="50e870b3a89ed3452e2a35b508840929",
),
"evaluation.audio": download_utils.RemoteFileMetadata(
filename="TUT-sound-events-2017-evaluation.audio.zip",
url=(
"https://zenodo.org/record/1040179/files/TUT-sound-events-2017-"
"evaluation.audio.zip?download=1"
),
checksum="1d3aa81896be0f142130ca9ca7a2b871",
),
"evaluation.doc": download_utils.RemoteFileMetadata(
filename="TUT-sound-events-2017-evaluation.doc.zip",
url=(
"https://zenodo.org/record/1040179/files/TUT-sound-events-2017-"
"evaluation.doc.zip?download=1"
),
checksum="8bbf41671949edee15d6cdc3f9e726c9",
),
"evaluation.meta": download_utils.RemoteFileMetadata(
filename="TUT-sound-events-2017-evaluation.meta.zip",
url=(
"https://zenodo.org/record/1040179/files/TUT-sound-events-2017-"
"evaluation.meta.zip?download=1"
),
checksum="a951598abaea87296ca409e30fb0b379",
),
}
LICENSE_INFO = "TUT License <https://github.com/TUT-ARG/DCASE2017-baseline-system/blob/master/EULA.pdf>"
[docs]class Clip(core.Clip):
"""TUT Sound events 2017 Clip class
Args:
clip_id (str): id of the clip
Attributes:
audio (np.ndarray, float): path to the audio file
audio_path (str): path to the audio file
annotations_path (str): path to the annotations file
clip_id (str): clip id
events (soundata.annotations.Events): sound events with start time,
end time, label and confidence
non_verified_annotations_path (str): path to the non-verified
annotations file
non_verified_events (soundata.annotations.Events): non-verified sound
events with start time, end time, label and confidence
split (str): subset the clip belongs to (for experiments):
development (fold1, fold2, fold3, fold4) or evaluation
"""
def __init__(self, clip_id, data_home, dataset_name, index, metadata):
super().__init__(clip_id, data_home, dataset_name, index, metadata)
self.audio_path = self.get_path("audio")
self.annotations_path = self.get_path("annotations")
self.non_verified_annotations_path = self.get_path("non_verified_annotations")
@property
def audio(self) -> Optional[Tuple[np.ndarray, float]]:
"""The clip's audio
Returns:
* np.ndarray - audio signal
* float - sample rate
"""
return load_audio(self.audio_path)
@property
def split(self):
"""The clip's split.
Returns:
* str - subset the clip belongs to (for experiments): development (fold1, fold2, fold3, fold4) or evaluation
"""
return self._clip_metadata.get("split")
@core.cached_property
def events(self) -> Optional[annotations.Events]:
"""The clip's events.
Returns:
* annotations.Events - sound events with start time, end time, label and confidence
"""
return load_events(self.annotations_path)
@core.cached_property
def non_verified_events(self) -> Optional[annotations.Events]:
"""The clip's non verified events path
Returns:
* str - path to the non-verified annotations file
"""
return load_events(self.non_verified_annotations_path)
[docs] def to_jams(self):
"""Get the clip's data in jams format
Returns:
jams.JAMS: the clip's data in jams format
"""
return jams_utils.jams_converter(
audio_path=self.audio_path, events=self.events, metadata=self._clip_metadata
)
[docs]@io.coerce_to_bytes_io
def load_audio(fhandle: BinaryIO, sr=None) -> Tuple[np.ndarray, float]:
"""Load a TUT Sound events 2017 audio file
Args:
fhandle (str or file-like): File-like object or path to audio file
sr (int or None): sample rate for loaded audio, None by default, which
uses the file's original sample rate of 44100 without resampling.
Returns:
* np.ndarray - the stereo audio signal
* float - The sample rate of the audio file
"""
audio, sr = librosa.load(fhandle, sr=sr, mono=False)
return audio, sr
[docs]@io.coerce_to_string_io
def load_events(fhandle: TextIO) -> annotations.Events:
"""Load an TUT Sound events 2017 annotation file
Args:
fhandle (str or file-like): File-like object or path to the sound
events annotation file
Returns:
Events: sound events annotation data
"""
times = []
labels = []
confidence = []
reader = csv.reader(fhandle, delimiter="\t")
for line in reader:
offset = (
0 if len(line) == 3 else 2
) # ann files in dev and eval have different format
times.append([float(line[offset]), float(line[offset + 1])])
labels.append(line[offset + 2])
confidence.append(1.0)
events_data = annotations.Events(
np.array(times), "seconds", labels, "open", np.array(confidence)
)
return events_data
[docs]@core.docstring_inherit(core.Dataset)
class Dataset(core.Dataset):
"""The TUT Sound events 2017 dataset"""
def __init__(self, data_home=None):
super().__init__(
data_home,
name="tut2017se",
clip_class=Clip,
bibtex=BIBTEX,
remotes=REMOTES,
license_info=LICENSE_INFO,
)
[docs] @core.copy_docs(load_audio)
def load_audio(self, *args, **kwargs):
return load_audio(*args, **kwargs)
[docs] @core.copy_docs(load_events)
def load_events(self, *args, **kwargs):
return load_events(*args, **kwargs)
@core.cached_property
def _metadata(self):
splits = [
"development.fold1",
"development.fold2",
"development.fold3",
"development.fold4",
"evaluation",
]
metadata_index = {}
for split in splits:
if split.split(".")[0] == "development":
evaluation_setup_path = (
"TUT-sound-events-2017-development/evaluation_setup"
)
fold = split.split(".")[1]
evaluation_setup_file = os.path.join(
self.data_home,
evaluation_setup_path,
"street_{}_test.txt".format(fold),
)
else:
evaluation_setup_path = (
"TUT-sound-events-2017-evaluation/evaluation_setup"
)
evaluation_setup_file = os.path.join(
self.data_home, evaluation_setup_path, "street_test.txt"
)
with open(evaluation_setup_file) as csv_file:
csv_reader = csv.reader(csv_file, delimiter="\t")
for row in csv_reader:
file_name = os.path.basename(row[0])
clip_id = os.path.basename(file_name).replace(".wav", "")
metadata_index[clip_id] = {"split": split}
return metadata_index