Supported Datasets and Annotations
This table is provided as a guide for users to select appropriate datasets. The list of annotations omits some metadata for brevity, and we document the dataset’s primary annotations only. To access comprehensive details and API documentation for each dataset, please consult the section dataset loaders within the documentation.
“Downloadable” possible values:
✅ Freely downloadable
📺 Youtube Links only
❌ Not available
Tasks Codes (More information at the bottom of the page):
Explore each dataset’s documentation by clicking its name. For Soundata API usage, the dataset ID is displayed below each name
Annotation Types
The table above provides annotation types as a guide for choosing appropriate datasets. Here we provide a rough guide to the types in this table, but we strongly recommend reading the dataset specific documentation to ensure the data is as you expect. To see how these annotation types are implemented in Soundata see Annotations.
Tags
One or more string labels with corresponding confidence values. Tags do not have start or end times,
and span the full duration of the clip. Tags are used to represent annotations for:
Acoustic Scene Classification (ASC)
Sound Event Classification (SEC)
Sound Event Detection (SED) - weak labels
When every Tags annotation in a dataset contains exactly one label, it is typically a multi-class problem.
When Tags annotations contain varying numbers of labels (including 0), it is typically a multi-label problem.
Events
Sound events with a start time, end time, label, and confidence. Events are used to represent annotations for:
Sound Event Detection (SED) - strong labels
Spatial Events
Spatial events represent annotations used for various applications, including spatial event detection and tracking. Similar to Sound Events, Spatial Events include essential attributes such as start time, end time, label, and confidence to characterize and annotate spatial phenomena. This can be extended to include additional attributes specific to the application, such as geographical coordinates (latitude, longitude), altitude, direction (azimuth and elevation), and distance from reference points.
Spatial events are used to represent annotations for:
Sound Event Detection (SED) + Sound Event Localization (SEL)



