Contributing to Soundata

We encourage contributions to soundata, especially new dataset loaders. To contribute a new loader, follow the steps indicated below and create a Pull Request (PR) to the github repository. For any doubt or comment about your contribution, you can always submit an issue or open a discussion in the repository.

Quick link to contributing templates

If you’re familiar with Soundata’s API already, you can find the template files for contributing here, and the loader checklist for submitting your PR here.

Installing soundata for development purposes

To install Soundata for development purposes:

First, run git clone https://github.com/soundata/soundata.git

Then, after opening source data library you have to install all the dependencies:

Install Core dependencies with pip install .

Install Testing dependencies with pip install ."[tests]"

Install Docs dependencies with pip install ."[docs]"

Install Plotting dependencies with pip install ."[plots]"

We recommend using miniconda or pyenv to manage your Python versions and install all soundata requirements. You will want to install the latest supported Python versions (see README.md). Once conda or pyenv and the Python versions are configured, install pytest. Make sure you’ve installed all the necessary pytest plugins needed (e.g. pytest-cov) to automatically test your code successfully.

Before running the tests, make sure to have formatted soundata/ and tests/ with black.

black soundata/ tests/

Also, make sure that they pass flake8 and mypy tests specified in lint-python.yml github action workflow.

flake8 soundata --count --select=E9,F63,F7,F82 --show-source --statistics
python -m mypy soundata --ignore-missing-imports --allow-subclassing-any

Finally, run:

pytest -vv --cov-report term-missing --cov-report=xml --cov=soundata tests/ --local

All tests should pass!

Note

Soundata assumes that your system has the zip library installed for unzipping files.

Writing a new dataset loader

The steps to add a new dataset loader to soundata are:

Create an index
Create a module
Add tests
Update Soundata documentation
Upload index to Zenodo
Create a Pull Request on GitHub

Before starting, if your dataset is not fully downloadable you should:

Contact the soundata team by opening an issue or PR so we can discuss how to proceed with the closed dataset.
Show that the version used to create the checksum is the “canonical” one, either by getting the version from the dataset creator, or by verifying equivalence with several other copies of the dataset.

To reduce friction, we will make commits on top of contributors PRs by default unless the please-do-not-edit flag is used.

1. Create an index

Soundata’s structure relies on indexes. Indexes are dictionaries that contain information about the structure of the dataset which is necessary for the loading and validating functionalities of Soundata. In particular, indexes contain information about the files included in the dataset, their location and checksums, see some example indexes below. To create an index, the necessary steps are:

Create a script in scripts/, called make_<datasetname>_index.py, which generates an index file.
Then run the script on the canonical version of the dataset and save the index in soundata/datasets/indexes/ as <datasetname>_index.json.
When the dataloader is completed and the PR is accepted, upload the index in our Zenodo community. See more details here.

The function make_<datasetname>_index.py should automate the generation of an index by computing the MD5 checksums for given files in a dataset located at data_path. Users can adapt this function to create an index for their dataset by adding their file paths and using the md5 function to generate checksums for their files.

Here’s an example of an index to use as a guide:

Example Make Index Script

import argparse
import glob
import json
import os
from soundata.validate import md5

DATASET_INDEX_PATH = "../soundata/datasets/indexes/dataset_index.json"


def make_dataset_index(dataset_data_path):
    annotation_dir = os.path.join(dataset_data_path, "annotation")
    annotation_files = glob.glob(os.path.join(annotation_dir, "*.lab"))
    clip_ids = sorted([os.path.basename(f).split(".")[0] for f in annotation_files])

    # top-key level metadata
    metadata_checksum = md5(os.path.join(dataset_data_path, "id_mapping.txt"))
    index_metadata = {"metadata": {"id_mapping": ("id_mapping.txt", metadata_checksum)}}

    # top-key level clips
    index_clips = {}
    for clip_id in clip_ids:
        audio_checksum = md5(
            os.path.join(dataset_data_path, "Wavfile/{}.wav".format(clip_id))
        )
        annotation_checksum = md5(
            os.path.join(dataset_data_path, "annotation/{}.lab".format(clip_id))
        )

        index_clips[clip_id] = {
            "audio": ("Wavfile/{}.wav".format(clip_id), audio_checksum),
            "annotation": ("annotation/{}.lab".format(clip_id), annotation_checksum),
        }

    # top-key level version
    dataset_index = {"version": None}

    # combine all in dataset index
    dataset_index.update(index_metadata)
    dataset_index.update({"clips": index_clips})

    with open(DATASET_INDEX_PATH, "w") as fhandle:
        json.dump(dataset_index, fhandle, indent=2)


def main(args):
    make_dataset_index(args.dataset_data_path)


if __name__ == "__main__":
    PARSER = argparse.ArgumentParser(description="Make dataset index file.")
    PARSER.add_argument(
        "dataset_data_path", type=str, help="Path to dataset data folder."
    )

    main(PARSER.parse_args())

More examples of scripts used to create dataset indexes can be found in the scripts folder.

Note

Users should be able to create the dataset indexes without the need for additional dependencies that are not included in soundata by default. Should you need an additional dependency for a specific reason, please open an issue to discuss with the Soundata maintainers the need for it.

Example index with clips

Most sound datasets are organized as a collection of clips and annotations. In such case, the index should make use of the clips top-level key. Under this clips top-level key, you should store a dictionary where the keys are the unique clip ids of the dataset, and the values are dictionaries of files associated with a clip id, along with their checksums. These files can be for instance audio files or annotations related to the clip id. File paths are relative to the top level directory of a dataset.

Note

If your sound dataset does not fit into a structure around the clip class, please open an issue in the GitHub repository to discuss how to proceed. These are corner cases that we address especially to maintain the consistency of the library.

Currently, Soundata does not include built-in functions to automatically create train, test, and validation splits if these are not originally defined in the dataset. Users can do that using external functions such as sklearn.model_selection.train_test_split. If a dataset has predefined splits, you can include the split name as an attribute of the Clip class. You should not create separate indexes for the different splits, or indicate the split in the index. See an example of how an index should look like:

Index Examples - Clips

If the version 1.0 of a given dataset has the structure:

> Example_Dataset/
    > audio/
        clip1.wav
        clip2.wav
        clip3.wav
    > annotations/
        clip1.csv
        clip2.csv
        clip3.csv
    > metadata/
        metadata_file.csv

The top level directory is Example_Dataset and the relative path for clip1.wav would be audio/clip1.wav. Any unavailable fields are indicated with null. A possible index file for this example would be:

{
    "version": "1.0",
        "clips":
            "clip1": {
                "audio": [
                    "audio/clip1.wav",  // the relative path for clip1's audio file
                    "912ec803b2ce49e4a541068d495ab570"  // clip1.wav's md5 checksum
                ],
                "annotation": [
                    "annotations/clip1.csv",  // the relative path for clip1's annotation
                    "2cf33591c3b28b382668952e236cccd5"  // clip1.csv's md5 checksum
                ]
            },
            "clip2": {
                "audio": [
                    "audio/clip2.wav",
                    "65d671ec9787b32cfb7e33188be32ff7"
                ],
                "annotation": [
                    "annotations/Clip2.csv",
                    "e1964798cfe86e914af895f8d0291812"
                ]
            },
            "clip3": {
                "audio": [
                    "audio/clip3.wav",
                    "60edeb51dc4041c47c031c4bfb456b76"
                ],
                "annotation": [
                    "annotations/clip3.csv",
                    "06cb006cc7b61de6be6361ff904654b3"
                ]
            },
        }
    "metadata": {
            "metadata_file": [
                "metadata/metadata_file.csv",
                "7a41b280c7b74e2ddac5184708f9525b"
            ]
    }
}

Note

In this example there is a (purposeful) mismatch between the name of the audio file clip2.wav and its corresponding annotation file, Clip2.csv, compared with the other pairs. This mismatch should be included in the index. This type of slight difference in filenames happens often in publicly available datasets, making pairing audio and annotation files more difficult. We use a fixed, version-controlled index to account for this kind of mismatch, rather than relying on string parsing on load.

2. Create a module

Once the index is created you can create the loader. For that, we suggest you use the following template and adjust it for your dataset. To quickstart a new module:

Copy the example below and save it to soundata/datasets/<your_dataset_name>.py
Find & Replace Example with the <your_dataset_name>.
Remove any lines beginning with # – which are there as guidelines.

You should follow the provided template as much as possible, and use the recommended functions and classes.

Example Module

"""Example Dataset Loader

.. admonition:: Dataset Info
    :class: dropdown

    Please include the following information at the top level docstring for the dataset's module `dataset.py`:

    1. Describe annotations included in the dataset
    2. Indicate the size of the datasets (e.g. number files and duration, hours)
    3. Mention the origin of the dataset (e.g. creator, institution)
    4. Indicate any relevant papers related to the dataset
    5. Include a description about how the data can be accessed and the license it uses (if applicable)
    6. Indicate the dataset version

"""
import os
import csv
import json


import librosa
import numpy as np
# -- import whatever you need here and remove
# -- example imports you won't use

from soundata import download_utils, jams_utils, core, annotations, io

# -- Add any relevant citations here
BIBTEX = """
@article{article-minimal,
  author = "L[eslie] B. Lamport",
  title = "The Gnats and Gnus Document Preparation System",
  journal = "G-Animal's Journal",
  year = "1986"
}
"""
# -- INDEXES specifies different versions of a dataset
# -- "default" and "test" specify which key should be used by default, and when running tests
# -- Each index is defined by {"version": core.Index instance}
# -- | filename: index name
# -- | url: Zenodo direct download link of the index (will be available afer the index upload is
# -- accepted to Audio Data Loaders Zenodo community).
# -- | checksum: Checksum of the index hosted at Zenodo.
# -- Direct url for download and checksum can be found in the Zenodo entry of the dataset.
# -- Sample index is a mini-version that makes it easier to test a large datasets.
# -- There must be a local sample index for testing for each remote index.
INDEXES = {
    "default": "1.0",
    "test": "sample",
    "1.0": core.Index(
        filename="urbansound8k_index_1.0.json",
        url="https://zenodo.org/records/11176928/files/urbansound8k_index_1.0.json?download=1",
        checksum="1c4940e08c1305c49b592f3d9c103e6f",
    ),
    "sample": core.Index(filename="urbansound8k_index_1.0_sample.json"),
}
# -- REMOTES is a dictionary containing all files that need to be downloaded.
# -- The keys should be descriptive (e.g. 'annotations', 'audio').
# -- When having data that can be partially downloaded, remember to set up
# -- correctly destination_dir to download the files following the correct structure.
REMOTES = {
    'remote_data': download_utils.RemoteFileMetadata(
        filename='a_zip_file.zip',
        url='http://website/hosting/the/zipfile.zip',
        checksum='00000000000000000000000000000000',  # -- the md5 checksum
        destination_dir='path/to/unzip' # -- relative path for where to unzip the data, or None
    ),
}

# -- Include any information that should be printed when downloading
# -- remove this variable if you don't need to print anything during download
DOWNLOAD_INFO = """
Include any information you want to be printed when dataset.download() is called.
These can be instructions for how to download the dataset (e.g. request access on zenodo),
caveats about the download, etc
"""

# -- Include the dataset's license information
LICENSE_INFO = """
The dataset's license information goes here.
"""


class Clip(core.Clip):
    """Example Clip class
    
    # -- YOU CAN AUTOMATICALLY GENERATE THIS DOCSTRING BY CALLING THE SCRIPT:
    # -- `scripts/print_track_docstring.py my_dataset`
    # -- note that you'll first need to have a test clip (see "Adding tests to your dataset" below)

    Args:
        clip_id (str): clip id of the clip

    Attributes:
        clip_id (str): clip id
        # -- Add any of the dataset specific attributes here

    """
    def __init__(self, clip_id, data_home, dataset_name, index, metadata):
        
        # -- this sets the following attributes:
        # -- * clip_id
        # -- * _dataset_name
        # -- * _data_home
        # -- * _clip_paths
        # -- * _clip_metadata
        super().__init__(
            clip_id,
            data_home,
            dataset_name=dataset_name,
            index=index,
            metadata=metadata,
        )
        
        # -- add any dataset specific attributes here
        self.audio_path = self.get_path("audio")
        self.annotation_path = self.get_path("annotation")

    # -- `annotation` will behave like an attribute, but it will only be loaded
    # -- and saved when someone accesses it. Useful when loading slightly
    # -- bigger files or for bigger datasets. By default, we make any time
    # -- series data loaded from a file a cached property
    @core.cached_property
    def annotation(self):
        """output type: description of output"""
        return load_annotation(self.annotation_path)

    # -- `audio` will behave like an attribute, but it will only be loaded
    # -- when someone accesses it and it won't be stored. By default, we make
    # -- any memory heavy information (like audio), properties
    @property
    def audio(self):
        """(np.ndarray, float): DESCRIPTION audio signal, sample rate"""
        return load_audio(self.audio_path)

    # -- we use the to_jams function to convert all the annotations in the JAMS format.
    # -- The converter takes as input all the annotations in the proper format (e.g. tags)
    # -- and returns a jams object with the annotations.
    def to_jams(self):
        """Jams: the clip's data in jams format"""
        return jams_utils.jams_converter(
            audio_path=self.audio_path,
            annotation_data=[(self.annotation, None)],
            metadata=self._metadata,
        )
        # -- see the documentation for `jams_utils.jams_converter for all fields

@io.coerce_to_bytes_io
def load_audio(fhandle):
    """Load a Example audio file

    Args:
        fhandle (str or file-like): path or file-like object pointing to an audio file

    Returns:
        * np.ndarray - the audio signal
        * float - The sample rate of the audio file
    """
    # -- for example, the code below. This should be dataset specific!
    # -- By default we load to mono
    # -- change this if it doesn't make sense for your dataset.
    return librosa.load(fhandle, sr=None, mono=True)


# -- Write any necessary loader functions for loading the dataset's data
@io.coerce_to_string_io
def load_annotation(fhandle):

    # -- if there are some file paths for this annotation type in this dataset's
    # -- index that are None/null, uncomment the lines below.
    # if annotation_path is None:
    #     return None

    reader = csv.reader(fhandle, delimiter=' ')
    intervals = []
    annotation = []
    for line in reader:
        intervals.append([float(line[0]), float(line[1])])
        annotation.append(line[2])

    annotation_data = annotations.EventData(
        np.array(intervals), np.array(annotation)
    )
    return annotation_data

# -- use this decorator so the docs are complete (i.e. they are inherited from the parent class)
@core.docstring_inherit(core.Dataset)
class Dataset(core.Dataset):
    """The Example dataset"""

    def __init__(self, data_home=None, version="default"):
        super().__init__(
            data_home,
            version,
            name='dataset_name',
            clip_class=Clip,
            bibtex=BIBTEX,
            indexes=INDEXES,
            remotes=REMOTES,
            download_info=DOWNLOAD_INFO,
            license_info=LICENSE_INFO,
        )

    # -- Copy any loader functions you wrote that should be part of the Dataset class
    # -- use this core.copy_docs decorator to copy the docs from the parent class
    # -- load_ function
    @core.copy_docs(load_audio)
    def load_audio(self, *args, **kwargs):
        return load_audio(*args, **kwargs)

    @core.copy_docs(load_annotation)
    def load_annotation(self, *args, **kwargs):
        return load_annotation(*args, **kwargs)

    # -- if your dataset has a top-level metadata file, write a loader for it here
    # -- you do not have to include this function if there is no metadata 
    @core.cached_property
    def _metadata(self):

        # load metadata however makes sense for your dataset
        metadata_path = os.path.join(self.data_home, 'example_metadata.json')
        with open(metadata_path, 'r') as fhandle:
            metadata = json.load(fhandle)

        return metadata

    # -- if your dataset needs to overwrite the default download logic, do it here.
    # -- this function is usually not necessary unless you need very custom download logic
    def download(
        self, partial_download=None, force_overwrite=False, cleanup=False
    ):
        """Download the dataset

        Args:
            partial_download (list or None):
                A list of keys of remotes to partially download.
                If None, all data is downloaded
            force_overwrite (bool):
                If True, existing files are overwritten by the downloaded files. 
            cleanup (bool):
                Whether to delete any zip/tar files after extracting.

        Raises:
            ValueError: if invalid keys are passed to partial_download
            IOError: if a downloaded file's checksum is different from expected

        """
        # see download_utils.downloader for basic usage - if you only need to call downloader
        # once, you do not need this function at all.
        # only write a custom function if you need it!

You may find these examples useful as references:

Declare constant variables

Please, include the variables BIBTEX, INDEXES, REMOTES, and LICENSE_INFO at the beginning of your module. While BIBTEX (including the bibtex-formatted citation of the dataset), INDEXES (indexes urls, checksums and versions), and LICENSE_INFO (including the license that protects the dataset in the dataloader) are mandatory, REMOTES is only defined if the dataset is openly downloadable.

INDEXES

As seen in the example, we have two ways to define an index: providing a URL to download the index file, or by providing the filename of the index file, assuming it is available locally (like sample indexes).

The full indexes for each version of the dataset should be retrieved from our Zenodo community. See more details here.
The sample indexes should be locally stored in the tests/indexes/ folder, and directly accessed through filename. See more details here.

Important: We do recommend to set the highest version of the dataset as the default version in the INDEXES variable. However, if there is a reason for having a different version as the default, please do so.

REMOTES

Should be a list of RemoteFileMetadata objects, which are used to download the dataset files. See an example below:

REMOTES = {
    "all": download_utils.RemoteFileMetadata(
        filename="UrbanSound8K.tar.gz",
        url="https://zenodo.org/record/1203745/files/UrbanSound8K.tar.gz?download=1",
        checksum="9aa69802bbf37fb986f71ec1483a196e",
        unpack_directories=["UrbanSound8K"],
    ),
}

Add more RemoteFileMetadata objects to the REMOTES dictionary if the dataset is split into multiple files. Please use download_utils.RemoteFileMetadata to parse the dataset from an online repository, which takes cares of the download process and the checksum validation, and addresses corner carses. Please do NOT use specific functions like download_zip_file or download_and_extract individually in your loader.

Note

Direct url for download and checksum can be found in the Zenodo entries of the dataset and index. Bear in mind that the url and checksum for the index will be available once a maintainer of the Audio Data Loaders Zenodo community has accepted the index upload. For other repositories, you may need to generate the checksum yourself. You may use the function provided in soundata.validate.py.

Document your loader

Make sure to include, in the docstring of the dataloader, information about the following list of relevant aspects about the dataset you are integrating:

The dataset name.
A general purpose description, the task it is used for.
Details about the coverage: how many clips, how many hours of audio, how many classes, the annotations available, etc.
The license of the dataset (even if you have included the LICENSE_INFO variable already).
The authors of the dataset, the organization in which it was created, and the year of creation (even if you have included the BIBTEX variable already).
Please reference also any relevant link or website that users can check for more information.

Note

In addition to the module docstring, you should write docstrings for every new class and function you write. See the documentation tutorial for practical information on best documentation practices.

This docstring is important for users to understand the dataset and its purpose. Having proper documentation also enhances transparency, and helps users to understand the dataset better. Please do not include complicated tables, big pieces of text, or unformatted copy-pasted text pieces. It is important that the docstring is clean, and the information is very clear to users. This will also engage users to use the dataloader!

For many more examples, see the datasets folder.

Note

If the dataset you are trying to integrate stores every clip in a separated compressed file, it cannot be currently supported by soundata. Feel free to open and issue to discuss a solution (hopefully for the near future!)

3. Add tests

To finish your contribution, please include tests that check the integrity of your loader. For this, follow these steps:

Make a toy version of the dataset in the tests folder tests/resources/sound_datasets/my_dataset/, so you can test against little data. For example:
- Include all audio and annotation files for one clip of the dataset.
- For each audio/annotation file, reduce the audio length to 1-2 seconds and remove all but a few of the annotations.
- If the dataset has a metadata file, reduce the length to a few lines.
Create a toy index corresponding to the one-clip toy dataset in the tests folder tests/indexes/. Some further detail:
- The index should include only the clips you need for the toy dataset for testing.
- The index should be named <dataset-id>_index_<dataset-version>_sample.json. The version in the JSON file should also be sample.
- Include this index in the INDEXES variable in your dataloader module.
- Then, when testing your dataset, initialize it passing version='test' in the .initialize() method.
Test all of the dataset specific code, e.g. the public attributes of the Clip class, the load functions and any other custom functions you wrote. See the tests folder for reference.
Locally run pytest -s tests/test_full_dataset.py --local --dataset my_dataset before submitting your loader to make sure everything is working.

Warning

The test_full_dataset won’t pass unless you add the checksum of the main index in the INDEXES variable. The checksum is automatically computed when uploading the index to Zenodo, but at this point, you can compute the checksum using the function soundata.validate.md5(), passing the path to the index file as an argument. The checksum should be added to the INDEXES variable, specifically as argument checksum in the core.Index object of the main index.

Note

We have written automated tests for all loader’s cite, download, validate, load, clip_ids functions, as well as some basic edge cases of the Clip class, so you don’t need to write tests for these!

Example Test File

import numpy as np

from soundata import annotations
from soundata.datasets import example  # the name of your loader here
from tests.test_utils import run_clip_tests

TEST_DATA_HOME = "tests/resources/sound_datasets/example"

def test_clip():
    default_clipid = "some_id"
    dataset = example.Dataset(TEST_DATA_HOME, version="test")
    clip = dataset.clip(default_clipid)

    expected_attributes = {
        "clip_id": "some_id",
        "audio_path": "tests/resources/sound_datasets/example/audio/some_id.wav",
        "annotation_path": "tests/resources/sound_datasets/example/annotation/some_id.pv",
    }

    # List here all the properties of your loader
    expected_property_types = {"tags": annotations.Tags,
                               "some_other_annotation": "some_annotation_type"}

    run_clip_tests(clip, expected_attributes, expected_property_types)

# Test all the load functions, for instance, the load audio one
def test_load_audio():
    dataset = example.Dataset(TEST_DATA_HOME)
    clip = dataset.clip("some_id")
    audio_path = clip.audio_path
    audio, sr = example.load_audio(audio_path)
    assert sr == 44100
    assert type(audio) is np.ndarray
    assert len(audio.shape) == 1  # check audio is loaded e.g. as mono
    assert audio.shape[0] == 44100  # Check audio duration in samples is as expected


def test_to_jams():

    default_clipid = "some_id"
    data_home = "tests/resources/sound_datasets/dataset"
    dataset = example.Dataset(data_home, version="test")
    clip = dataset.clip(default_clipid)
    jam = clip.to_jams()

    annotations = jam.search(namespace="annotation")[0]["data"]
    assert [annotation.time for annotation in annotations] == [0.027, 0.232]
    assert [annotation.duration for annotation in annotations] == [
        0.20500000000000002,
        0.736,
    ]
    # ... etc

# Test each of the load functions (e.g. Tags, etc)
def test_load_annotation():
    # load a file which exists
    annotation_path = "tests/resources/sound_datasets/dataset/annotation/some_id.pv"
    annotation_data = example.load_annotation(annotation_path)

    # check types
    assert type(annotation_data) == "some_annotation_type"
    assert type(annotation_data.times) is np.ndarray
    # ... etc

    # check values
    assert np.array_equal(annotation_data.times, np.array([0.016, 0.048]))
    # ... etc


def test_metadata():
    data_home = "tests/resources/sound_datasets/dataset"
    dataset = example.Dataset(data_home, version="test")
    metadata = dataset._metadata
    assert metadata["some_id"] == "something"

Running your tests locally

Before creating a PR you should run the tests. But before that, make sure to have formatted soundata/ and tests/ with black.

black soundata/ tests/

Also, make sure that they pass flake8 and mypy tests specified in lint-python.yml github action workflow.

flake8 soundata --count --select=E9,F63,F7,F82 --show-source --statistics
python -m mypy soundata --ignore-missing-imports --allow-subclassing-any

Finally, run all the tests locally like this:

pytest -vv --cov-report term-missing --cov-report=xml --cov=soundata tests/ --local

The --local flag skips tests that are built to run only on the remote testing environment.

To run one specific test file:

pytest tests/test_urbansed.py

Finally, there is one local test you should run, which we can’t easily run in our testing environment.

pytest -s tests/test_full_dataset.py --local --dataset dataset

Where dataset is the name of the module of the dataset you added. The -s tells pytest not to skip print statements, which is useful here for seeing the download progress bar when testing the download function.

This tests that your dataset downloads, validates, and loads properly for every clip. This test takes a long time for some datasets, but it’s important to ensure the integrity of the library.

The --skip-download flag can be added to pytest command to run the tests skipping the download. This will skip the downloading step. Note that this is just for convenience during debugging - the tests should eventually all pass without this flag.

Working with big datasets

In the development of large datasets, it is advisable to create an index as small as possible to optimize the implementation process of the dataset loader and pass the tests.

Reducing the testing space usage

We are trying to keep the test resources folder size as small as possible, because it can get really heavy as new loaders are added. We kindly ask the contributors to reduce the size of the testing data if possible (e.g. trimming the audio clips, keeping just two rows for csv files).

4. Update Soundata documentation

Make sure to include your module info in the following files:

Add your module to docs/source/soundata.rst following an alphabetical order.
Add your module to docs/source/table.rst following an alphabetical order as follows:

* - Dataset
  - Downloadable?
  - Annotations
  - Clips
  - Hours
  - Usecase
  - License

An example of this for the UrbanSound8k dataset:

* - UrbanSound8K
  - - audio: ✅
    - annotations: ✅
  - :ref:`tags`
  - 8732
  - 8.75
  - Urban sound classification
  - .. image:: https://licensebuttons.net/l/by-nc/4.0/80x15.png
       :target: https://creativecommons.org/licenses/by-nc/4.0

You can find license badges images and links here.

5. Uploading the index to Zenodo

We store all dataset indexes in an online repository on Zenodo. To use a dataloader, users may retrieve the index running the dataset.download() function that is also used to download the dataset. To download only the index, you may run .download(["index"]). The index will be automatically downloaded and stored in the expected folder in Soundata.

From a contributor point of view, you may create the index, store it locally, and develop the dataloader. All JSON files in soundata/indexes/ are included in the .gitignore file, therefore there is no need to remove it when pushing to the remote branch during development, since it will be ignored by git.

Important! When creating the PR, please submit your index to our Zenodo community:

First, click on New upload.
Add your index in the Upload files section.
Let Zenodo create a DOI for your index, so click No.
Resource type is Other.
Title should be soundata-<dataset-id>_index_<version>, e.g. soundata-tau2021sse_nigens_index_1.2.0.
Add yourself as the Creator of this entry.
The license of the index should be the same as Soundata.
Visibility should be set as Public.

Note

<dataset-id> is the identifier we use to initialize the dataset using soundata.initialize(). It’s also the filename of your dataset module.

6. Create a Pull Request

Please, create a Pull Request with all your development. When starting your PR please use the new_loader.md template, it will simplify the reviewing process and also help you make a complete PR. You can do that by adding &template=new_loader.md at the end of the url when you are creating the PR :

...soundata/soundata/compare?expand=1 will become ...soundata/soundata/compare?expand=1&template=new_loader.md.

Troubleshooting

If github shows a red X next to your latest commit, it means one of our checks is not passing. This could mean:

running black has failed – this means that your code is not formatted according to black’s code-style. To fix this, simply run the following from inside the top level folder of the repository:

black soundata/ tests/

Your code does not pass flake8 test.

flake8 soundata --count --select=E9,F63,F7,F82 --show-source --statistics

Your code does not pass mypy test.

python -m mypy soundata --ignore-missing-imports --allow-subclassing-any

the test coverage is too low – this means that there are too many new lines of code introduced that are not tested.
the docs build has failed – this means that one of the changes you made to the documentation has caused the build to fail. Check the formatting in your changes and make sure they are consistent.
the tests have failed – this means at least one of the tests is failing. Run the tests locally to make sure they are passing. If they are passing locally but failing in the check, open an issue and we can help debug.

Documentation

This documentation is in rst format. It is built using Sphinx and hosted on readthedocs. The API documentation is built using autodoc, which autogenerates documentation from the code’s docstrings. We use the napoleon plugin for building docs in Google docstring style. See the next section for docstring conventions.

Docstring conventions

soundata uses Google’s Docstring formatting style. Here are some common examples.

Note

The small formatting details in these examples are important. Differences in new lines, indentation, and spacing make a difference in how the documentation is rendered. For example writing Returns: will render correctly, but Returns or Returns : will not.

Functions:

def add_to_list(list_of_numbers, scalar):
    """Add a scalar to every element of a list.
    You can write a continuation of the function description here on the next line.

    You can optionally write more about the function here. If you want to add an example
    of how this function can be used, you can do it like below.

    Example:
        .. code-block:: python

        foo = add_to_list([1, 2, 3], 2)

    Args:
        list_of_numbers (list): A short description that fits on one line.
        scalar (float):
            Description of the second parameter. If there is a lot to say you can
            overflow to a second line.

    Returns:
        list: Description of the return. The type here is not in parentheses

    """
    return [x + scalar for x in list_of_numbers]

Functions with more than one return value:

def multiple_returns():
    """This function has no arguments, but more than one return value. Autodoc with napoleon doesn't handle this well,
    and we use this formatting as a workaround.

    Returns:
        * int - the first return value
        * bool - the second return value

    """
    return 42, True

One-line docstrings

def some_function():
    """
    One line docstrings must be on their own separate line, or autodoc does not build them properly
    """
    ...

Objects

"""Description of the class
overflowing to a second line if it's long

Some more details here

Args:
    foo (str): First argument to the __init__ method
    bar (int): Second argument to the __init__ method

Attributes:
    foobar (str): First clip attribute
    barfoo (bool): Second clip attribute

Cached Properties:
    foofoo (list): Cached properties are special soundata attributes
    barbar (None): They are lazy loaded properties.
    barf (bool): Document them with this special header.

"""

Documenting your contribution

Staged docs for every new PR are built and accessible at soundata--<#PR_ID>.org.readthedocs.build/en/<#PR_ID>/ in which <#PR_ID> is the pull request ID. To quickly troubleshoot any issues, you can build the docs locally by navigating to the docs folder, and running make clean html (note, you must have sphinx installed). Then open the generated soundata/docs/_build/source/index.html file in your web browser to view.

Important: Make sure to check out the WARNINGS and ERROR messages that may show up in the terminal when running make clean html. These will indicate formatting, listing, and indentation problems that may be present in your docstrings and that need to be fixed for a proper rendering of the documentation. See the examples aboove and also the docstrings of docs/source/contributing_examples/example.py to see a list of examples of how to write the docstrings to prevent Sphinx errors and warning messages.

Conventions

Loading from files

We use the following libraries for loading data from files:

Format	library
audio (wav, mp3, …)	librosa
json	json
csv	csv
jams	jams

Clip Attributes

Custom clip attributes should be global, clip-level data. For some datasets, there is a separate, dataset-level metadata file with clip-level metadata, e.g. as a csv. When a single file is needed for more than one clip, we recommend using writing a _metadata cached property (which returns a dictionary, either keyed by clip_id or freeform) in the Dataset class (see the dataset module example code above). When this is specified, it will populate a clip’s hidden _clip_metadata field, which can be accessed from the clip class.

For example, if _metadata returns a dictionary of the form:

{
    'clip1': {
        'microphone-type': 'Awesome',
        'recording-date': '27.10.2021'
    },
    'clip2': {
        'microphone-type': 'Less_awesome',
        'recording-date': '27.10.2021'
    }
}

the _clip metadata for clip_id=clip2 will be:

{
    'microphone-type': 'Less_awesome',
    'recording-date': '27.10.2021'
}

Load methods vs Clip properties

Clip properties and cached properties should be simple, and directly call a load_* method. Like this example from urbansed:

@property
def split(self):
    """The data splits (e.g. train)

    Returns
        * str - split

    """
    return self._clip_metadata.get("split")

@core.cached_property
def events(self) -> Optional[annotations.Events]:
    """The audio events

    Returns
        * annotations.Events - audio event object

    """
    return load_events(self.txt_path)

There should be no additional logic in a clip property/cached property, and instead all logic should be done in the load method. We separate these because the clip properties are only usable when data is available locally - when data is remote, the load methods are used instead.

Missing Data

Clip properties that are available for some clips and not for others should be set to None when whey are not available. Like this example in the tau2019aus loader:

@property
def tags(self):
    scene_label = self._clip_metadata.get("scene_label")
    if scene_label is None:
        return None
    else:
        return annotations.Tags([scene_label], "open", np.array([1.0]))

The index should only contain key-values for files that exist.

Custom Decorators

cached_property

This is used primarily for Clip classes.

This decorator causes an Object’s function to behave like an attribute (aka, like the @property decorator), but caches the value in memory after it is first accessed. This is used for data which is relatively large and loaded from files.

docstring_inherit

This decorator is used for children of the Dataset class, and copies the Attributes from the parent class to the docstring of the child. This gives us clear and complete docs without a lot of copy-paste.

copy_docs

This decorator is used mainly for a dataset’s load_ functions, which are attached to a loader’s Dataset class. The attached function is identical, and this decorator simply copies the docstring from another function.

coerce_to_bytes_io/coerce_to_string_io

These are two decorators used to simplify the loading of various Clip members in addition to giving users the ability to use file streams instead of paths in case the data is in a remote location e.g. GCS. The decorators modify the function to:

Return None if None is passed in.
Open a file if a string path is passed in either ‘w’ mode for string_io or wb for bytes_io and pass the file handle to the decorated function.
Pass the file handle to the decorated function if a file-like object is passed.

This cannot be used if the function to be decorated takes multiple arguments. coerce_to_bytes_io should not be used if trying to load an mp3 with librosa as libsndfile does not support mp3 yet and audioread expects a path.