Supported Datasets and Annotations

This table is provided as a guide for users to select appropriate datasets. The list of annotations omits some metadata for brevity, and we document the dataset’s primary annotations only. To access comprehensive details and API documentation for each dataset, please consult the section dataset loaders within the documentation.

“Downloadable” possible values:

✅ Freely downloadable
📺 Youtube Links only
❌ Not available

Tasks Codes (More information at the bottom of the page):

SEL Sound Event Localization
SED Sound Event Detection
SEC Sound Event Classification
ASC Acoustic Scene Classification
AC Audio Captioning

Explore each dataset’s documentation by clicking its name. For Soundata API usage, the dataset ID is displayed below each name

Dataset	Downloadable?	Annotations	Clips	Hours	Tasks	Soundscapes	License
3D-MARCo ID: `marco`	audio: ✅ annotations: ✅	Tags	26	0.3	SEL	MUSIC
Clotho ID: `clotho`	audio: ✅ annotations: ✅	Tags	5929	37.05	AC	ENVIRONMENT
DCASE23-Task2 ID: `dcase23_task2`	audio: ✅ annotations: ✅	Tags	174	21	SEC	MACHINE
DCASE23-Task4B ID: `dcase23_task4b`	audio: ✅ annotations: ✅	Events	49	3.16	SED	ENVIRONMENT BIOACOUSTIC
DCASE23-Task6A ID: `dcase23_task6a`	audio: ✅ annotations: ✅	Tags	6974	43.2	AC
DCASE23-Task6B ID: `dcase23_task6b`	audio: ✅ annotations: ✅	Tags	6974	43.2	AC
DCASE-Bioacoustic ID: `dcase_bioacoustic`	audio: ✅ annotations: ✅	Events	174	21	SED	BIOACOUSTIC
DCASE-BirdVox20k ID: `dcase_birdVox20k`	audio: ✅ annotations: ✅	Tags	20,000	55.5	SEC	BIOACOUSTIC
EigenScape (HOA 25 ch) ID: `eigenscape`	audio: ✅ annotations: ✅	Tags	64	10.7	ASC
EigenScape Raw (32 ch) ID: `eigenscape_raw`	audio: ✅ annotations: ✅	Tags	64	10.7	ASC
ESC-50 ID: `esc50`	audio: ✅ annotations: ✅	Tags	2000	2.8	SEC	ENVIRONMENT
Freefield1010 ID: `freefield1010`	audio: ✅ annotations: ✅	Tags	7690	21.3	SEC	BIOACOUSTIC
FSD50K ID: `fsd50k`	audio: ✅ annotations: ✅	Tags	51197	108.3	SEC	ENVIRONMENT MUSIC BIOACOUSTIC URBAN MACHINE
FSDnoisy18K ID: `fsdnoisy18k`	audio: ✅ annotations: ✅	Tags	18532	42.5	SEC	ENVIRONMENT MUSIC MACHINE
SINGA:PURA ID: `singapura`	audio: ✅ annotations: ✅	Events	6547	18.2	SED	URBAN
STARSS 2022 ID: `starss2022`	audio: ✅ annotations: ✅	Spatial Events	121	5	SED SEL	ENVIRONMENT MUSIC
TAU NIGENS SSE 2020 ID: `tau2020sse_nigens`	audio: ✅ annotations: ✅	Spatial Events	800	15	SED SEL	ENVIRONMENT MUSIC BIOACOUSTIC MACHINE
TAU NIGENS SSE 2021 ID: `tau2021sse_nigens`	audio: ✅ annotations: ✅	Spatial Events	800	15	SED SEL	ENVIRONMENT MUSIC BIOACOUSTIC MACHINE
TAU Urban Acoustic Scenes 2019 ID: `tau2019uas`	audio: ✅ annotations: ✅	Tags	22800	63.3	ASC	URBAN	Custom
TAU Urban Acoustic Scenes 2020 Mobile ID: `tau2020uas_mobile`	audio: ✅ annotations: ✅	Tags	34915	97	ASC	URBAN	Custom
TAU Urban Acoustic Scenes 2022 Mobile ID: `tau2022uas_mobile`	audio: ✅ annotations: ✅	Tags	349150	97	ASC	URBAN	Custom
TAU SSE 2019 ID: `tau2019sse`	audio: ✅ annotations: ✅	Spatial Events	500	8.3	SED SEL	ENVIRONMENT	Custom
TUT Sound Events 2017 ID: `tut2017se`	audio: ✅ annotations: ✅	Events	32	2.02	SED	ENVIRONMENT	Custom
URBAN-SED ID: `urbansed`	audio: ✅ annotations: ✅	Events	10000	27.8	SED	URBAN
UrbanSound8K ID: `urbansound8k`	audio: ✅ annotations: ✅	Tags	8732	8.75	SEC	URBAN
Warblrb10k ID: `warblrb10k`	audio: ✅ annotations: ✅	Tags	10,000	28	SEC	BIOACOUSTIC

Annotation Types

The table above provides annotation types as a guide for choosing appropriate datasets. Here we provide a rough guide to the types in this table, but we strongly recommend reading the dataset specific documentation to ensure the data is as you expect. To see how these annotation types are implemented in Soundata see Annotations.

Tags

One or more string labels with corresponding confidence values. Tags do not have start or end times, and span the full duration of the clip. Tags are used to represent annotations for:

Acoustic Scene Classification (ASC)
Sound Event Classification (SEC)
Sound Event Detection (SED) - weak labels

When every Tags annotation in a dataset contains exactly one label, it is typically a multi-class problem. When Tags annotations contain varying numbers of labels (including 0), it is typically a multi-label problem.

Events

Sound events with a start time, end time, label, and confidence. Events are used to represent annotations for:

Sound Event Detection (SED) - strong labels

Spatial Events

Spatial events represent annotations used for various applications, including spatial event detection and tracking. Similar to Sound Events, Spatial Events include essential attributes such as start time, end time, label, and confidence to characterize and annotate spatial phenomena. This can be extended to include additional attributes specific to the application, such as geographical coordinates (latitude, longitude), altitude, direction (azimuth and elevation), and distance from reference points. Spatial events are used to represent annotations for:

Sound Event Detection (SED) + Sound Event Localization (SEL)

Usecases

Tasks

Sound Event Localization (SEL)

SEL involves determining the spatial location from where a sound originates within an environment. It goes beyond detection and classification to include the position in space relative to the listener or recording device.

Sound Event Detection (SED)

SED is concerned with identifying the presence and duration of sound events within an audio stream. It uses both weak labels (Tags) for presence and strong labels (Events) for temporal localization of sound events.

Sound Event Classification (SEC)

SEC categorizes sounds into predefined classes and involves analyzing audio to assign a category based on the type of sound event it contains, using Tags for the entire clip’s duration.

Acoustic Scene Classification (ASC)

ASC classifies an entire audio stream into a scene category, characterizing the recording’s environment. Tags are used to indicate the single acoustic scene represented in the clip.

Audio Captioning (AC)

AC involves generating a textual description of the sound events and context within an audio clip. It is similar to image captioning but for audio content.

Soundscapes

URBAN

Urban environments are characterized by a blend of sounds from traffic, human activity, construction, and sometimes nature. Recordings in urban areas are often used to study noise pollution, city planning, or to create soundscapes for multimedia productions.

ENVIRONMENT

The spectrum of environmental sounds includes all the background noises found in various habitats. These auditory elements can be as diverse as the whisper of foliage in woodlands, the gentle flow of water in brooks, or the fierce gusts of wind sweeping through arid landscapes.

MACHINE

Machine sounds refer to the audio signatures of mechanical devices, such as engines, factory machinery, household appliances, and office equipment. These sounds are crucial for monitoring equipment performance, diagnosing faults, and designing sound-aware applications.

BIOACOUSTIC

Bioacoustic sounds are produced by biological organisms, like the vocalizations of animals and birds. Studying these sounds can provide insights into animal behavior, biodiversity, and ecosystem health.

MUSIC

Music sounds encompass the vast array of musical compositions, instruments, and the human voice as used in singing. These sounds are central to the entertainment industry, cultural studies, and music therapy.