'News' corpus

Description

The 'news' corpus is made up of a set of news readings recorded in studio by professional speakers. It contains two diferent subcorpora:

The 'prosodic' corpus: a set of 36 news readings selected according to prosodic criteria.
The 'phonetic' corpus: a set of 36 news readings selected to offer phonetic coverage.

Spanish and Catalan subcorpora contain versions of the same news contents (originally in Spanish, then translated into Catalan), so they can be considered parallel corpora.

Speakers

8 speakers per language:

Contents

Each news text item has a unique ID, wich indicates:

the language of the interaction: Catalan ('ca') or Spanish ('sp')
the speaker ID, indicated with its Glissando ID
the subcorpus: prosodic ('prn') or phonetic ('phn')
a two-figure ID indicating the news text number in the corpus

For example:

sp_f11r_prn01: news text 1, prosodic corpus, speaker 11 (female, radio professional), Spanish

The following items are available for each news text:

The speech signal (wav files, mono, 16000 Hz sampling rate):
- input signal from the fixed microphone ('.fix.wav' files)
- input signal from the wireless (headset) microphone ('.wir.wav' files)
- input signal from the laringograph, when available ('.lar.wav' files)
The orthographic transcription (txt files, UTF-8)
The time-aligned phonetic and prosodic annotation (Praat TextGrid format)
Raw F0 and intensity values (Praat format):
- intensity values obtained from the '.fix.wav' speech file ('.fix.Intensity' files)
- intensity values obtained from the '.wir.wav' speech file ('.wir.Intensity' files)
- F0 values obtained from the '.fix.wav' speech file ('.fix.Pitch' files)
- F0 values obtained from the '.wir.wav' speech file ('.wir.Pitch' files)