Five unique human-annotated descriptions for every audio clip.

Reference the original paper: Drossos, K., Lipping, S., & Virtanen, T. (2020). "Clotho: an Audio Captioning Dataset." Proc. IEEE ICASSP, pp. 736-740 .

If you are writing a technical report or paper using this data, ensure you include these standard sections: