Supported Datasets and Annotations

This table is provided as a guide for users to select appropriate datasets. The list of annotations omits some metadata for brevity, and we document the dataset’s primary annotations only. To access comprehensive details and API documentation for each dataset, please consult the section dataset loaders within the documentation.

“Downloadable” possible values:

  • ✅ Freely downloadable

  • 📺 Youtube Links only

  • ❌ Not available

Tasks Codes (More information at the bottom of the page):

SEL Sound Event Localization
SED Sound Event Detection
SEC Sound Event Classification
ASC Acoustic Scene Classification
AC Audio Captioning

Explore each dataset’s documentation by clicking its name. For Soundata API usage, the dataset ID is displayed below each name

Dataset

Downloadable?

Annotations

Clips

Hours

Tasks

Soundscapes

License

ID: marco
audio: ✅
annotations: ✅

Tags

26

0.3

https://licensebuttons.net/l/by-nc/3.0/80x15.png
ID: dcase23_task2
audio: ✅
annotations: ✅

Tags

174

21

https://licensebuttons.net/l/by/4.0/80x15.png
ID: dcase23_task4b
audio: ✅
annotations: ✅

Events

49

3.16

https://licensebuttons.net/l/by-nc/3.0/80x15.png
ID: dcase23_task6a
audio: ✅
annotations: ✅

Tags

6974

43.2


https://licensebuttons.net/l/by/4.0/80x15.png
ID: dcase23_task6b
audio: ✅
annotations: ✅

Tags

6974

43.2


https://licensebuttons.net/l/by/4.0/80x15.png
ID: dcase_bioacoustic
audio: ✅
annotations: ✅

Events

174

21

https://licensebuttons.net/l/by/4.0/80x15.png
ID: dcase_birdVox20k
audio: ✅
annotations: ✅

Tags

20,000

55.5

https://licensebuttons.net/l/by/4.0/80x15.png
(HOA 25 ch)
ID: eigenscape
audio: ✅
annotations: ✅

Tags

64

10.7


https://licensebuttons.net/l/by/4.0/80x15.png
(32 ch)
ID: eigenscape_raw
audio: ✅
annotations: ✅

Tags

64

10.7


https://licensebuttons.net/l/by/4.0/80x15.png
ID: esc50
audio: ✅
annotations: ✅

Tags

2000

2.8

https://licensebuttons.net/l/by-nc/3.0/80x15.png
ID: freefield1010
audio: ✅
annotations: ✅

Tags

7690

21.3

https://licensebuttons.net/l/by/4.0/80x15.png
ID: fsd50k
audio: ✅
annotations: ✅

Tags

51197

108.3

https://licensebuttons.net/l/by/4.0/80x15.png
ID: fsdnoisy18k
audio: ✅
annotations: ✅

Tags

18532

42.5

https://licensebuttons.net/l/by/4.0/80x15.png
ID: singapura
audio: ✅
annotations: ✅

Events

6547

18.2

https://licensebuttons.net/l/by-sa/4.0/80x15.png
ID: starss2022
audio: ✅
annotations: ✅

Spatial Events

121

5

https://img.shields.io/badge/License-MIT-blue.svg
ID: tau2020sse_nigens
audio: ✅
annotations: ✅

Spatial Events

800

15

https://licensebuttons.net/l/by-nc/4.0/80x15.png
ID: tau2021sse_nigens
audio: ✅
annotations: ✅

Spatial Events

800

15

https://licensebuttons.net/l/by-nc/4.0/80x15.png
ID: tau2019uas
audio: ✅
annotations: ✅

Tags

22800

63.3

Custom

ID: tau2020uas_mobile
audio: ✅
annotations: ✅

Tags

34915

97

Custom

ID: tau2022uas_mobile
audio: ✅
annotations: ✅

Tags

349150

97

Custom

ID: tau2019sse
audio: ✅
annotations: ✅

Spatial Events

500

8.3

Custom

ID: tut2017se
audio: ✅
annotations: ✅

Events

32

2.02

Custom

ID: urbansed
audio: ✅
annotations: ✅

Events

10000

27.8

https://licensebuttons.net/l/by/4.0/80x15.png
ID: urbansound8k
audio: ✅
annotations: ✅

Tags

8732

8.75

https://licensebuttons.net/l/by-nc/4.0/80x15.png
ID: warblrb10k
audio: ✅
annotations: ✅

Tags

10,000

28

https://licensebuttons.net/l/by/4.0/80x15.png

Annotation Types

The table above provides annotation types as a guide for choosing appropriate datasets. Here we provide a rough guide to the types in this table, but we strongly recommend reading the dataset specific documentation to ensure the data is as you expect. To see how these annotation types are implemented in Soundata see Annotations.

Tags

One or more string labels with corresponding confidence values. Tags do not have start or end times, and span the full duration of the clip. Tags are used to represent annotations for:

  • Acoustic Scene Classification (ASC)

  • Sound Event Classification (SEC)

  • Sound Event Detection (SED) - weak labels

When every Tags annotation in a dataset contains exactly one label, it is typically a multi-class problem. When Tags annotations contain varying numbers of labels (including 0), it is typically a multi-label problem.

Events

Sound events with a start time, end time, label, and confidence. Events are used to represent annotations for:

  • Sound Event Detection (SED) - strong labels

Spatial Events

Spatial events represent annotations used for various applications, including spatial event detection and tracking. Similar to Sound Events, Spatial Events include essential attributes such as start time, end time, label, and confidence to characterize and annotate spatial phenomena. This can be extended to include additional attributes specific to the application, such as geographical coordinates (latitude, longitude), altitude, direction (azimuth and elevation), and distance from reference points. Spatial events are used to represent annotations for:

  • Sound Event Detection (SED) + Sound Event Localization (SEL)

Usecases

Tasks

SEL involves determining the spatial location from where a sound originates within an environment. It goes beyond detection and classification to include the position in space relative to the listener or recording device.
SED is concerned with identifying the presence and duration of sound events within an audio stream. It uses both weak labels (Tags) for presence and strong labels (Events) for temporal localization of sound events.
SEC categorizes sounds into predefined classes and involves analyzing audio to assign a category based on the type of sound event it contains, using Tags for the entire clip’s duration.
ASC classifies an entire audio stream into a scene category, characterizing the recording’s environment. Tags are used to indicate the single acoustic scene represented in the clip.
AC involves generating a textual description of the sound events and context within an audio clip. It is similar to image captioning but for audio content.

Soundscapes

Urban environments are characterized by a blend of sounds from traffic, human activity, construction, and sometimes nature. Recordings in urban areas are often used to study noise pollution, city planning, or to create soundscapes for multimedia productions.
The spectrum of environmental sounds includes all the background noises found in various habitats. These auditory elements can be as diverse as the whisper of foliage in woodlands, the gentle flow of water in brooks, or the fierce gusts of wind sweeping through arid landscapes.
Machine sounds refer to the audio signatures of mechanical devices, such as engines, factory machinery, household appliances, and office equipment. These sounds are crucial for monitoring equipment performance, diagnosing faults, and designing sound-aware applications.
Bioacoustic sounds are produced by biological organisms, like the vocalizations of animals and birds. Studying these sounds can provide insights into animal behavior, biodiversity, and ecosystem health.
Music sounds encompass the vast array of musical compositions, instruments, and the human voice as used in singing. These sounds are central to the entertainment industry, cultural studies, and music therapy.