Supported Datasets and Annotations

This table is provided as a guide for users to select appropriate datasets. The list of annotations omits some metadata for brevity, and we document the dataset’s primary annotations only. To access comprehensive details and API documentation for each dataset, please consult the section dataset loaders within the documentation.

“Downloadable” possible values:

  • ✅ Freely downloadable

  • 📺 Youtube Links only

  • ❌ Not available

Tasks Codes (More information at the bottom of the page):

SEL Sound Event Localization
SED Sound Event Detection
SEC Sound Event Classification
ASC Acoustic Scene Classification
AC Audio Captioning

Please note that you can click on each tag to access more information related to that specific usecase.

Dataset

Downloadable?

Annotations

Clips

Hours

Tasks

Soundscapes

License

3D-MARCo

audio: ✅
annotations: ✅

Tags

26

0.3

https://licensebuttons.net/l/by-nc/3.0/80x15.png

DCASE23-Task2

audio: ✅
annotations: ✅

Tags

174

21

https://licensebuttons.net/l/by/4.0/80x15.png

DCASE23-Task4B

audio: ✅
annotations: ✅

Events

49

3.16

https://licensebuttons.net/l/by-nc/3.0/80x15.png

DCASE23-Task6A

audio: ✅
annotations: ✅

Tags

6974

43.2


https://licensebuttons.net/l/by/4.0/80x15.png

DCASE23-Task6B

audio: ✅
annotations: ✅

Tags

6974

43.2


https://licensebuttons.net/l/by/4.0/80x15.png

DCASE-Bioacoustic

audio: ✅
annotations: ✅

Events

174

21

https://licensebuttons.net/l/by/4.0/80x15.png

DCASE-BirdVox20k

audio: ✅
annotations: ✅

Tags

20,000

55.5

https://licensebuttons.net/l/by/4.0/80x15.png
(HOA 25 ch)
audio: ✅
annotations: ✅

Tags

64

10.7


https://licensebuttons.net/l/by/4.0/80x15.png
(32 ch)
audio: ✅
annotations: ✅

Tags

64

10.7


https://licensebuttons.net/l/by/4.0/80x15.png

ESC-50

audio: ✅
annotations: ✅

Tags

2000

2.8

https://licensebuttons.net/l/by-nc/3.0/80x15.png

Freefield1010

audio: ✅
annotations: ✅

Tags

7690

21.3

https://licensebuttons.net/l/by/4.0/80x15.png

FSD50K

audio: ✅
annotations: ✅

Tags

51197

108.3

https://licensebuttons.net/l/by/4.0/80x15.png

FSDnoisy18K

audio: ✅
annotations: ✅

Tags

18532

42.5

https://licensebuttons.net/l/by/4.0/80x15.png

SINGA:PURA

audio: ✅
annotations: ✅

Events

6547

18.2

https://licensebuttons.net/l/by-sa/4.0/80x15.png
audio: ✅
annotations: ✅

Spatial Events

121

5

https://img.shields.io/badge/License-MIT-blue.svg
audio: ✅
annotations: ✅

Spatial Events

800

15

https://licensebuttons.net/l/by-nc/4.0/80x15.png
audio: ✅
annotations: ✅

Spatial Events

800

15

https://licensebuttons.net/l/by-nc/4.0/80x15.png
audio: ✅
annotations: ✅

Tags

22800

63.3

Custom

audio: ✅
annotations: ✅

Tags

34915

97

Custom

audio: ✅
annotations: ✅

Tags

349150

97

Custom

audio: ✅
annotations: ✅

Spatial Events

500

8.3

Custom

audio: ✅
annotations: ✅

Events

32

2.02

Custom

URBAN-SED

audio: ✅
annotations: ✅

Events

10000

27.8

https://licensebuttons.net/l/by/4.0/80x15.png

UrbanSound8K

audio: ✅
annotations: ✅

Tags

8732

8.75

https://licensebuttons.net/l/by-nc/4.0/80x15.png

Warblrb10k

audio: ✅
annotations: ✅

Tags

10,000

28

https://licensebuttons.net/l/by/4.0/80x15.png

Annotation Types

The table above provides annotation types as a guide for choosing appropriate datasets. Here we provide a rough guide to the types in this table, but we strongly recommend reading the dataset specific documentation to ensure the data is as you expect. To see how these annotation types are implemented in Soundata see Annotations.

Tags

One or more string labels with corresponding confidence values. Tags do not have start or end times, and span the full duration of the clip. Tags are used to represent annotations for:

  • Acoustic Scene Classification (ASC)

  • Sound Event Classification (SEC)

  • Sound Event Detection (SED) - weak labels

When every Tags annotation in a dataset contains exactly one label, it is typically a multi-class problem. When Tags annotations contain varying numbers of labels (including 0), it is typically a multi-label problem.

Events

Sound events with a start time, end time, label, and confidence. Events are used to represent annotations for:

  • Sound Event Detection (SED) - strong labels

Spatial Events

Spatial events represent annotations used for various applications, including spatial event detection and tracking. Similar to Sound Events, Spatial Events include essential attributes such as start time, end time, label, and confidence to characterize and annotate spatial phenomena. This can be extended to include additional attributes specific to the application, such as geographical coordinates (latitude, longitude), altitude, direction (azimuth and elevation), and distance from reference points. Spatial events are used to represent annotations for:

  • Sound Event Detection (SED) + Sound Event Localization (SEL)

Usecases

Tasks

SEL involves determining the spatial location from where a sound originates within an environment. It goes beyond detection and classification to include the position in space relative to the listener or recording device.
SED is concerned with identifying the presence and duration of sound events within an audio stream. It uses both weak labels (Tags) for presence and strong labels (Events) for temporal localization of sound events.
SEC categorizes sounds into predefined classes and involves analyzing audio to assign a category based on the type of sound event it contains, using Tags for the entire clip’s duration.
ASC classifies an entire audio stream into a scene category, characterizing the recording’s environment. Tags are used to indicate the single acoustic scene represented in the clip.
AC involves generating a textual description of the sound events and context within an audio clip. It is similar to image captioning but for audio content.

Soundscapes

Urban environments are characterized by a blend of sounds from traffic, human activity, construction, and sometimes nature. Recordings in urban areas are often used to study noise pollution, city planning, or to create soundscapes for multimedia productions.
The spectrum of environmental sounds includes all the background noises found in various habitats. These auditory elements can be as diverse as the whisper of foliage in woodlands, the gentle flow of water in brooks, or the fierce gusts of wind sweeping through arid landscapes.
Machine sounds refer to the audio signatures of mechanical devices, such as engines, factory machinery, household appliances, and office equipment. These sounds are crucial for monitoring equipment performance, diagnosing faults, and designing sound-aware applications.
Bioacoustic sounds are produced by biological organisms, like the vocalizations of animals and birds. Studying these sounds can provide insights into animal behavior, biodiversity, and ecosystem health.
Music sounds encompass the vast array of musical compositions, instruments, and the human voice as used in singing. These sounds are central to the entertainment industry, cultural studies, and music therapy.