Source code for soundata.datasets.fsd50k

"""FSD50K Dataset Loader

.. admonition:: Dataset Info
    :class: dropdown

    **FSD50K: an Open Dataset of Human-Labeled Sound Events**

    *Created By:*

        | Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra.
        | Music Technology Group, Universitat Pompeu Fabra (Barcelona). 
        
    Version 1.0

    *Description:*
        FSD50K is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally
        distributed in 200 classes drawn from the AudioSet Ontology. FSD50K has been created at the Music Technology Group
        of Universitat Pompeu Fabra.

    *Audio Files Included:*
        * FSD50K contains 51,197 audio clips from Freesound, totalling 108.3 hours of multi-labeled audio.
        * The audio content is composed mainly of sound events produced by physical sound sources and production mechanisms,
          including human sounds, sounds of things, animals, natural sounds, musical instruments and more. The vocabulary
          can be inspected in vocabulary.csv.
        * Clips are of variable length from 0.3 to 30s, due to the diversity of the sound classes and the preferences of
          Freesound users when recording sounds.
        * All clips are provided as uncompressed PCM 16 bit 44.1 kHz mono audio files.

    *Annotations Included:*
        * The dataset encompasses 200 sound classes (144 leaf nodes and 56 intermediate nodes) hierarchically organized
          with a subset of the AudioSet Ontology. Please refer to the included vocabulary.csv file for a complete list of
          considered classes.
        * The acoustic material has been manually labeled by humans following a data labeling
          process using the Freesound Annotator platform.
        * Ground truth labels are provided at the clip-level (i.e., weak labels).
        * Note: All classes in FSD50K are represented in AudioSet, except Crash cymbal, Human group actions, Human voice,
          Respiratory sounds, and Domestic sounds, home sounds.
        * Note: We use a slightly different format than AudioSet for the naming of class labels in order to avoid potential
          problems with spaces, commas, etc. Example: we use Accelerating_and_revving_and_vroom instead of the original
          Accelerating, revving, vroom. You can go back to the original AudioSet naming using the information provided in
          vocabulary.csv (class label and mid for the 200 classes of FSD50K) and the AudioSet Ontology specification.


    *Organization:*
        FSD50K is split in two subsets: the developement (dev) and the evaluation (eval) sets.
        Especifications of both subsets is detailed below:

    * Dev set:
        * 40,966 audio clips totalling 80.4 hours of audio
        * Avg duration/clip: 7.1s
        * 114,271 smeared labels (i.e., labels propagated in the upwards direction to the root of the ontology)
        * Labels are correct but could be occasionally incomplete
        * A train/validation split is provided. If a different split is used, it should be specified for reproducibility
          and fair comparability of results

    * Eval set:
        * 10,231 audio clips totalling 27.9 hours of audio
        * Avg duration/clip: 9.8s
        * 38,596 smeared labels
        * Eval set is labeled exhaustively (labels are correct and complete for the considered vocabulary)

    *Ground-truth Files Included:*
        FSD50K ground-truth is represented through the following file structure:

    * dev.csv:
        Each row (i.e. audio clip) of dev.csv contains the following information:

        * fname:
            The file name without the .wav extension, e.g., the fname 64760 corresponds to the file 64760.wav
            in disk. This number is the Freesound id. We always use Freesound ids as filenames.

        * labels:
            The class labels (i.e., the ground truth). Note these class labels are smeared, i.e., the labels
            have been propagated in the upwards direction to the root of the ontology. More details about the label
            smearing process can be found in Appendix D of our paper.

        * mids:
            The Freebase identifiers corresponding to the class labels, as defined in the AudioSet Ontology
            specification.

        * split:
            Whether the clip belongs to train or val (see paper for details on the proposed split)


    * eval.csv:
        Rows in eval.csv follow the same format as dev.csv, except that there is no split column.

    *Metadata Files Included:*
        To allow a variety of analysis and approaches with FSD50K, we provide the following metadata:

    * class_info_FSD50K.json:
        Python dictionary where each entry corresponds to one sound class and  contains: FAQs
        utilized during the annotation of the class, examples (representative audio clips), and verification_examples
        (audio clips presented to raters during annotation as a quality control mechanism). Audio clips are described by
        the Freesound id. Note: It may be that some of these examples are not included in the FSD50K release.


    * dev_clips_info_FSD50K.json:
        Python dictionary where each entry corresponds to one dev clip and contains: title,
        description, tags, clip license, and the uploader name. All these metadata are provided by the uploader.


    * eval_clips_info_FSD50K.json:
        Same as above, but with eval clips.


    * pp_pnp_ratings.json:
        Python dictionary where each entry corresponds to one clip in the dataset and contains
        the PP/PNP ratings for the labels associated with the clip. More specifically, these ratings are gathered for the
        labels validated in the validation task. This file includes 59,485 labels for the 51,197 clips in FSD50K.
        Out of these labels:

            * 56,095 labels have inter-annotator agreement (PP twice, or PNP twice). Each of these combinations can be
              occasionally accompanied by other (non-positive) ratings.

            * 3390 labels feature other rating configurations such as i) only one PP rating and one PNP rating (and nothing
              else). This can be considered inter-annotator agreement at the "Present" level; ii) only one PP rating (and
              nothing else); iii) only one PNP rating (and nothing else).

        Ratings' legend: PP=1; PNP=0.5; U=0; NP=-1.

        Note: The PP/PNP ratings have been provided in the validation task. Subsequently, a subset of these clips
        corresponding to the eval set was exhaustively labeled in the refinement task, hence receiving additional labels
        in many cases. For these eval clips, you might want to check their labels in eval.csv in order to have more info
        about their audio content.


    *collection folder:*
        This folder contains metadata for what we call the sound collection format. This format consists of
        the raw annotations gathered, featuring all generated class labels without any restriction.
        We provide the collection format to make available some annotations that do not appear in the FSD50K ground truth
        release. This typically happens in the case of classes for which we gathered human-provided annotations, but that
        were discarded in the FSD50K release due to data scarcity (more specifically, they were merged with their parents).
        In other words, the main purpose of the collection format is to make available annotations for tiny classes.
        The format of these files in analogous to that of the files in FSD50K.ground_truth/. A couple of examples show the
        differences between collection and ground truth formats:

        * clip:  labels_in_collection - labels_in_ground_truth

            * 51690:  Owl - Bird,Wild_Animal,Animal

            * 190579:  Toothbrush,Electric_toothbrush - Domestic_sounds_and_home_sounds

        In the first example, raters provided the label Owl. However, due to data scarcity, Owl labels were merged into
        their parent Bird. Then, labels Wild_Animal,Animal were added via label propagation (smearing). The second example
        shows one of the most extreme cases, where raters provided the labels Electric_toothbrush,Toothbrush, which both
        had few data. Hence, they were merged into Toothbrush's parent, which unfortunately is Domestic_sounds_and_home_sounds
        (a rather vague class containing a variety of children sound classes).

        Note: Labels in the collection format are not smeared.

        Note: While in FSD50K's ground truth the vocabulary encompasses 200 classes (common for dev and eval), since the
        collection format is composed of raw annotations, the vocabulary here is much larger (over 350 classes), and it is
        slightly different in dev and eval.

    *Please Acknowledge FSD50K in Academic Research:*
        If you use the FSD50K Dataset please cite the following paper:

        .. code-block:: latex
        
            Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv:2010.00475, 2020.

        The authors would like to thank everyone who contributed to FSD50K with annotations, and especially Mercedes
        Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary Avendano and Sara Fernandez for their commitment and
        perseverance. The authors would also like to thank Daniel P.W. Ellis and Manoj Plakal from Google Research for
        valuable discussions. This work is partially supported by the European Union’s Horizon 2020 research and innovation
        programme under grant agreement No 688382 AudioCommons, and two Google Faculty Research Awards 2017 and 2018, and
        the Maria de Maeztu Units of Excellence Programme (MDM-2015-0502).


    *License:*
        All audio clips in FSD50K are released under Creative Commons (CC) licenses. Each clip has its own license as
        defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some
        forbidding further commercial reuse. For attribution purposes and to facilitate attribution of these files to third
        parties, we include a mapping from the audio clips to their corresponding licenses. The licenses are specified in
        the files dev_clips_info_FSD50K.json and eval_clips_info_FSD50K.json. These licenses are CC0, CC-BY, CC-BY-NC and
        CC Sampling+.

        In addition, FSD50K as a whole is the result of a curation process and it has an additional license: FSD50K is
        released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSD50K.doc zip file.

        Usage of FSD50K for commercial purposes: If you'd like to use FSD50K for commercial purposes, please contact Eduardo
        Fonseca and Frederic Font at eduardo.fonseca@upf.edu and frederic.font@upf.edu.


    *Feedback:*
        For further questions, please contact eduardo.fonseca@upf.edu, or join the freesound-annotator Google Group.

"""

import os
from typing import BinaryIO, Optional, Tuple

import librosa
import csv
import json
import logging
import subprocess
import numpy as np

from soundata import download_utils, jams_utils, core, annotations, io


BIBTEX = """
@dataset{fonseca2020fsd50k,
    title={FSD50K: an Open Dataset of Human-Labeled Sound Events}, 
    author={Eduardo Fonseca and Xavier Favory and Jordi Pons and Frederic Font and Xavier Serra},
    year={2020},
    eprint={2010.00475},
    archivePrefix={arXiv},
    primaryClass={cs.SD}
}
"""

# a dictionary key that has a list of RemoteFileMetadata implies a multi-part zip
# and will be processed as such using the zip subprocess (see soundata.download_utils)
REMOTES = {
    "FSD50K.dev_audio": [
        download_utils.RemoteFileMetadata(
            filename="FSD50K.dev_audio.zip",
            url="https://zenodo.org/record/4060432/files/FSD50K.dev_audio.zip?download=1",
            checksum="c480d119b8f7a7e32fdb58f3ea4d6c5a",
        ),
        download_utils.RemoteFileMetadata(
            filename="FSD50K.dev_audio.z01",
            url="https://zenodo.org/record/4060432/files/FSD50K.dev_audio.z01?download=1",
            checksum="faa7cf4cc076fc34a44a479a5ed862a3",
        ),
        download_utils.RemoteFileMetadata(
            filename="FSD50K.dev_audio.z02",
            url="https://zenodo.org/record/4060432/files/FSD50K.dev_audio.z02?download=1",
            checksum="8f9b66153e68571164fb1315d00bc7bc",
        ),
        download_utils.RemoteFileMetadata(
            filename="FSD50K.dev_audio.z03",
            url="https://zenodo.org/record/4060432/files/FSD50K.dev_audio.z03?download=1",
            checksum="1196ef47d267a993d30fa98af54b7159",
        ),
        download_utils.RemoteFileMetadata(
            filename="FSD50K.dev_audio.z04",
            url="https://zenodo.org/record/4060432/files/FSD50K.dev_audio.z04?download=1",
            checksum="d088ac4e11ba53daf9f7574c11cccac9",
        ),
        download_utils.RemoteFileMetadata(
            filename="FSD50K.dev_audio.z05",
            url="https://zenodo.org/record/4060432/files/FSD50K.dev_audio.z05?download=1",
            checksum="81356521aa159accd3c35de22da28c7f",
        ),
    ],
    "FSD50K.eval_audio": [
        download_utils.RemoteFileMetadata(
            filename="FSD50K.eval_audio.zip",
            url="https://zenodo.org/record/4060432/files/FSD50K.eval_audio.zip?download=1",
            checksum="6fa47636c3a3ad5c7dfeba99f2637982",
        ),
        download_utils.RemoteFileMetadata(
            filename="FSD50K.eval_audio.z01",
            url="https://zenodo.org/record/4060432/files/FSD50K.eval_audio.z01?download=1",
            checksum="3090670eaeecc013ca1ff84fe4442aeb",
        ),
    ],
    "ground_truth": download_utils.RemoteFileMetadata(
        filename="FSD50K.ground_truth.zip",
        url="https://zenodo.org/record/4060432/files/FSD50K.ground_truth.zip?download=1",
        checksum="ca27382c195e37d2269c4c866dd73485",
    ),
    "metadata": download_utils.RemoteFileMetadata(
        filename="FSD50K.metadata.zip",
        url="https://zenodo.org/record/4060432/files/FSD50K.metadata.zip?download=1",
        checksum="b9ea0c829a411c1d42adb9da539ed237",
    ),
    "documentation": download_utils.RemoteFileMetadata(
        filename="FSD50K.doc.zip",
        url="https://zenodo.org/record/4060432/files/FSD50K.doc.zip?download=1",
        checksum="3516162b82dc2945d3e7feba0904e800",
    ),
}

LICENSE_INFO = "Creative Commons Attribution 4.0 International"


[docs]class Clip(core.Clip):
    """FSD50K Clip class

    Args:
        clip_id (str): id of the clip

    Attributes:
        audio (np.ndarray, float): path to the audio file
        audio_path (str): path to the audio file
        clip_id (str): clip id
        description (str): description of the sound provided by the Freesound uploader
        mids (soundata.annotations.Tags): tag (labels) encoded in Audioset formatting
        pp_pnp_ratings (dict): PP/PNP ratings given to the main label of the clip
        split (str): flag to identify if clip belongs to developement, evaluation or validation splits
        tags (soundata.annotations.Tags): tag (label) of the clip + confidence
        title (str): the title of the uploaded file in Freesound
    """

    def __init__(self, clip_id, data_home, dataset_name, index, metadata):
        super().__init__(clip_id, data_home, dataset_name, index, metadata)

        self.audio_path = self.get_path("audio")

    @property
    def audio(self) -> Optional[Tuple[np.ndarray, float]]:
        """The clip's audio.

        Returns:
            * np.ndarray - audio signal
            * float - sample rate

        """
        return load_audio(self.audio_path)

    @property
    def tags(self):
        """The clip's tags.

        Returns:
            * annotations.Tags - tag (label) of the clip + confidence

        """
        return annotations.Tags(
            self._clip_metadata["ground_truth"].get("tags"),
            "open",
            np.array([1.0] * len(self._clip_metadata["ground_truth"].get("tags"))),
        )

    @property
    def mids(self):
        """The clip's mids.

        Returns:
            * annotations.Tags - tag (labels) encoded in Audioset formatting

        """
        return annotations.Tags(
            self._clip_metadata["ground_truth"].get("mids"),
            "open",
            np.array([1.0] * len(self._clip_metadata["ground_truth"].get("tags"))),
        )

    @property
    def split(self):
        """The clip's split.

        Returns:
            * str - flag to identify if clip belongs to developement, evaluation or validation splits

        """
        return self._clip_metadata["ground_truth"].get("split")

    @property
    def title(self):
        """The clip's title.

        Returns:
            * str - the title of the uploaded file in Freesound

        """
        return self._clip_metadata["clip_info"].get("title")

    @property
    def description(self):
        """The clip's description.

        Returns:
            * str - description of the sound provided by the Freesound uploader

        """
        return self._clip_metadata["clip_info"].get("description")

    @property
    def pp_pnp_ratings(self):
        """The clip's PP/PNP ratings.

        Returns:
            * dict - PP/PNP ratings given to the main label of the clip
        """
        return self._clip_metadata.get("pp_pnp_ratings")

[docs]    def to_jams(self):
        """Get the clip's data in jams format

        Returns:
            jams.JAMS: the clip's data in jams format

        """
        return jams_utils.jams_converter(
            audio_path=self.audio_path,
            tags=self.tags,
            metadata={
                "split": self._clip_metadata["ground_truth"].get("split"),
                "mids": self._clip_metadata["ground_truth"].get("mids"),
                "pp_pnp_ratings": self._clip_metadata.get("pp_pnp_ratings"),
                "title": self._clip_metadata["clip_info"].get("title"),
                "description": self._clip_metadata["clip_info"].get("description"),
                "freesound_tags": self._clip_metadata["clip_info"].get("tags"),
                "license": self._clip_metadata["clip_info"].get("license"),
                "uploader": self._clip_metadata["clip_info"].get("uploader"),
            },
        )


[docs]@io.coerce_to_bytes_io
def load_audio(fhandle: BinaryIO, sr=None) -> Tuple[np.ndarray, float]:
    """Load a FSD50K audio file

    Args:
        fhandle (str or file-like): File-like object or path to audio file
        sr (int or None): sample rate for loaded audio, 44100 Hz by default.
            If different from file's sample rate it will be resampled on load.
            Use None to load the file using its original sample rate (sample rate
            varies from file to file).

    Returns:
        * np.ndarray - the mono audio signal
        * float - The sample rate of the audio file
    """
    audio, sr = librosa.load(fhandle, sr=sr, mono=True)
    return audio, sr


[docs]def load_ground_truth(data_path):
    """Load ground truth files of FSD50K

    Args:
        data_path (str): Path to the ground truth file

    Returns:
        * ground_truth_dict (dict): ground truth dict of the clips in the input split
        * clip_ids (list): list of clip ids of the input split
    """
    ground_truth_dict = {}
    clip_ids = []
    with open(data_path, "r") as fhandle:
        reader = csv.reader(fhandle, delimiter=",")
        next(reader)
        for line in reader:
            if len(line) == 3:
                if "collection" not in data_path:
                    ground_truth_dict[line[0]] = {
                        "tags": (
                            list(line[1].split(","))
                            if "," in line[1]
                            else list([line[1]])
                        ),
                        "mids": (
                            list(line[2].split(","))
                            if "," in line[2]
                            else list([line[2]])
                        ),
                        "split": "test",
                    }
                else:
                    ground_truth_dict[line[0]] = {
                        "tags": (
                            list(line[1].split(","))
                            if "," in line[1]
                            else list([line[1]])
                        ),
                        "mids": (
                            list(line[2].split(","))
                            if "," in line[2]
                            else list([line[2]])
                        ),
                    }
                clip_ids.append(line[0])
            if len(line) == 4:
                ground_truth_dict[line[0]] = {
                    "tags": (
                        list(line[1].split(",")) if "," in line[1] else list([line[1]])
                    ),
                    "mids": (
                        list(line[2].split(",")) if "," in line[2] else list([line[2]])
                    ),
                    "split": "train" if line[3] == "train" else "validation",
                }
                clip_ids.append(line[0])

    return ground_truth_dict, clip_ids


[docs]def load_fsd50k_vocabulary(data_path):
    """Load vocabulary of FSD50K to relate FSD50K labels with AudioSet onthology

    Args:
        data_path (str): Path to the vocabulary file

    Returns:
        * fsd50k_to_audioset (dict): vocabulary to convert FSD50K to AudioSet
        * audioset_to_fsd50k (dict): vocabulary to convert from AudioSet to FSD50K
    """
    fsd50k_to_audioset = {}
    audioset_to_fsd50k = {}
    with open(data_path, "r") as fhandle:
        reader = csv.reader(fhandle, delimiter=",")
        for line in reader:
            fsd50k_to_audioset[line[1]] = line[2]
            audioset_to_fsd50k[line[2]] = line[1]

    return fsd50k_to_audioset, audioset_to_fsd50k


[docs]@core.docstring_inherit(core.Dataset)
class Dataset(core.Dataset):
    """The FSD50K dataset"""

    def __init__(self, data_home=None):
        super().__init__(
            data_home,
            name="fsd50k",
            clip_class=Clip,
            bibtex=BIBTEX,
            remotes=REMOTES,
            license_info=LICENSE_INFO,
        )

        # Ground_truth paths
        self.ground_truth_dev_path = os.path.join(
            self.data_home, "FSD50K.ground_truth", "dev.csv"
        )
        self.ground_truth_eval_path = os.path.join(
            self.data_home, "FSD50K.ground_truth", "eval.csv"
        )

        # Sound collection format labels paths
        self.collection_dev_path = os.path.join(
            self.data_home, "FSD50K.metadata", "collection", "collection_dev.csv"
        )
        self.collection_eval_path = os.path.join(
            self.data_home, "FSD50K.metadata", "collection", "collection_eval.csv"
        )

        # Clip metadata paths
        self.clips_info_dev_path = os.path.join(
            self.data_home, "FSD50K.metadata", "dev_clips_info_FSD50K.json"
        )
        self.clips_info_eval_path = os.path.join(
            self.data_home, "FSD50K.metadata", "eval_clips_info_FSD50K.json"
        )

        # Class info path
        self.label_info_path = os.path.join(
            self.data_home, "FSD50K.metadata", "class_info_FSD50K.json"
        )

        # PP/PNP ratings path
        self.pp_pnp_ratings_path = os.path.join(
            self.data_home, "FSD50K.metadata", "pp_pnp_ratings_FSD50K.json"
        )

        # Vocabulary paths
        self.vocabulary_path = os.path.join(
            self.data_home, "FSD50K.ground_truth", "vocabulary.csv"
        )
        self.collection_vocabulary_dev_path = os.path.join(
            self.data_home,
            "FSD50K.metadata",
            "collection",
            "vocabulary_collection_dev.csv",
        )
        self.collection_vocabulary_eval_path = (
            self.collection_vocabulary_dev_path.replace("_dev", "_eval")
        )

[docs]    @core.copy_docs(load_audio)
    def load_audio(self, *args, **kwargs):
        return load_audio(*args, **kwargs)

[docs]    @core.copy_docs(load_ground_truth)
    def load_ground_truth(self, *args, **kwargs):
        return load_ground_truth(*args, **kwargs)

[docs]    @core.copy_docs(load_fsd50k_vocabulary)
    def load_fsd50k_vocabulary(self, *args, **kwargs):
        return load_fsd50k_vocabulary(*args, **kwargs)

    @property
    def fsd50k_to_audioset(self):
        return load_fsd50k_vocabulary(self.vocabulary_path)[0]

    @property
    def audioset_to_fsd50k(self):
        return load_fsd50k_vocabulary(self.vocabulary_path)[1]

    @property
    def label_info(self):
        return (
            json.load(open(self.label_info_path, "r"))
            if os.path.exists(self.label_info_path)
            else None
        )

    @property
    def collection_fsd50k_to_audioset(self):
        collection_fsd50k_to_audioset = {
            "dev": load_fsd50k_vocabulary(self.collection_vocabulary_dev_path)[0],
            "eval": load_fsd50k_vocabulary(self.collection_vocabulary_eval_path)[0],
        }
        return collection_fsd50k_to_audioset

    @property
    def collection_audioset_to_fsd50k(self):
        collection_audioset_to_fsd50k = {
            "dev": load_fsd50k_vocabulary(self.collection_vocabulary_dev_path)[1],
            "eval": load_fsd50k_vocabulary(self.collection_vocabulary_eval_path)[1],
        }
        return collection_audioset_to_fsd50k

    @core.cached_property
    def _metadata(self):
        metadata_index = {}

        ground_truth_dev, clip_ids_dev = load_ground_truth(self.ground_truth_dev_path)
        ground_truth_eval, clip_ids_eval = load_ground_truth(
            self.ground_truth_eval_path
        )

        collection_dev, _ = load_ground_truth(self.collection_dev_path)
        collection_eval, _ = load_ground_truth(self.collection_eval_path)

        clips_info_dev = (
            json.load(open(self.clips_info_dev_path, "r"))
            if os.path.exists(self.clips_info_dev_path)
            else None
        )
        clips_info_eval = (
            json.load(open(self.clips_info_eval_path, "r"))
            if os.path.exists(self.clips_info_eval_path)
            else None
        )

        pp_pnp_ratings = (
            json.load(open(self.pp_pnp_ratings_path, "r"))
            if os.path.exists(self.pp_pnp_ratings_path)
            else None
        )

        for clip_id in self.clip_ids:
            if clip_id in clip_ids_dev:
                metadata_index[clip_id] = {
                    "ground_truth": ground_truth_dev[clip_id],
                    "clip_info": clips_info_dev[clip_id],
                    "pp_pnp_ratings": pp_pnp_ratings[clip_id],
                    "collection_labels": collection_dev[clip_id],
                }
            if clip_id in clip_ids_eval:
                metadata_index[clip_id] = {
                    "ground_truth": ground_truth_eval[clip_id],
                    "clip_info": clips_info_eval[clip_id],
                    "pp_pnp_ratings": pp_pnp_ratings[clip_id],
                    "collection_labels": collection_eval[clip_id],
                }

        return metadata_index