Table 1: Abbreviations for languages used in Table 2
| Arabic |
A |
Japanese |
J |
| Chinese |
C |
Korean |
K |
| Czech |
Cze |
Malaysian |
M |
| Danish |
D |
Norwegian |
N |
| Dutch |
Dut |
Portuguese |
P |
| English |
E |
Russian |
R |
| French |
F |
Spanish |
S |
| German |
G |
Swedish |
Swe |
| Greek |
Gre |
Thai |
T |
| Indonesian |
Ind |
Vietnamese |
V |
| Italian |
I |
|
|
|
The actual table with information about the different databases is shown in Table 2.
More information Corpus, Author ...
Table 2: Overview of non-native Databases
| Corpus |
Author |
Available at |
Languages |
#Speakers |
Native Language |
#Utt. |
Duration |
Date |
Remarks |
| AMI [2] |
|
EU |
E |
|
Dut and other |
|
100h |
|
meeting recordings |
| ATR-Gruhn [3] |
Gruhn |
ATR |
E |
96 |
C G F J Ind |
15000 |
|
2004 |
proficiency rating |
| BAS Strange Corpus 1+10 [4] |
|
ELRA |
G |
139 |
50 countries |
7500 |
|
1998 |
|
| Berkeley Restaurant [5] |
|
ICSI |
E |
55 |
G I H C F S J |
2500 |
|
1994 |
|
| Broadcast News [6] |
|
LDC |
E |
|
|
|
|
1997 |
|
| Cambridge-Witt [7] |
Witt |
U. Cambridge |
E |
10 |
J I K S |
1200 |
|
1999 |
|
| Cambridge-Ye [8] |
Ye |
U. Cambridge |
E |
20 |
C |
1600 |
|
2005 |
|
| Children News [9] |
Tomokiyo |
CMU |
E |
62 |
J C |
7500 |
|
2000 |
partly spontaneous |
| CLIPS-IMAG [10] |
Tan |
CLIPS-IMAG |
F |
15 |
C V |
|
6h |
2006 |
|
| CLSU [11] |
|
LDC |
E |
|
22 countries |
5000 |
|
2007 |
telephone, spontaneous |
| CMU [12] |
|
CMU |
E |
64 |
G |
452 |
0.9h |
|
not available |
| Cross Towns [13] |
Schaden |
U. Bochum |
E F G I Cze Dut |
161 |
E F G I S |
72000 |
133h |
2006 |
city names |
| Duke-Arslan [14] |
Arslan |
Duke University |
E |
93 |
15 countries |
2200 |
|
1995 |
partly telephone speech |
| ERJ [15] |
Minematsu |
U. Tokyo |
E |
200 |
J |
68000 |
|
2002 |
proficiency rating |
| Fischer [16] |
|
LDC |
E |
|
many |
|
200h |
|
telephone speech |
| Fitt [17] |
Fitt |
U. Edinburgh |
F I N Gre |
10 |
E |
700 |
|
1995 |
city names |
| Fraenki [18] |
|
U. Erlangen |
E |
19 |
G |
2148 |
|
|
|
| Hispanic [19] |
Byrne |
|
E |
22 |
S |
|
20h |
1998 |
partly spontaneous |
| HLTC [20] |
|
HKUST |
E |
44 |
C |
|
3h |
2010 |
available on request |
| IBM-Fischer [21] |
|
IBM |
E |
40 |
S F G I |
2000 |
|
2002 |
digits |
| iCALL [22][23] |
Chen |
I2R, A*STAR |
C |
305 |
24 countries |
90841 |
142h |
2015 |
phonetic and tonal transcriptions (in Pinyin), proficiency ratings |
| ISLE [24] |
Atwell |
EU/ELDA |
E |
46 |
G I |
4000 |
18h |
2000 |
|
| Jupiter [25] |
Zue |
MIT |
E |
unknown |
unknown |
5146 |
|
1999 |
telephone speech |
| K-SEC [26] |
Rhee |
SiTEC |
E |
unknown |
K |
|
|
2004 |
|
| LDC WSJ1 [27] |
|
LDC |
|
10 |
|
800 |
1h |
1994 |
|
| LeaP [28] |
Gut |
University of Münster |
E G |
127 |
41 different ones |
73.941 words |
12h |
2003 |
|
| MIST [29] |
|
ELRA |
E F G |
75 |
Dut |
2200 |
|
1996 |
|
| NATO HIWIRE [30] |
|
NATO |
E |
81 |
F Gre I S |
8100 |
|
2007 |
clean speech |
| NATO M-ATC [31] |
Pigeon |
NATO |
E |
622 |
F G I S |
9833 |
17h |
2007 |
heavy background noise |
| NATO N4 [32] |
|
NATO |
E |
115 |
unknown |
|
7.5h |
2006 |
heavy background noise |
| Onomastica [33] |
|
|
D Dut E F G Gre I N P S Swe |
|
|
(121000) |
|
1995 |
only lexicon |
| PF-STAR [34] |
|
U. Erlangen |
E |
57 |
G |
4627 |
3.4h |
2005 |
children speech |
| Sunstar [35] |
|
EU |
E |
100 |
G S I P D |
40000 |
|
1992 |
parliament speech |
| TC-STAR [36] |
Heuvel |
ELDA |
E S |
unknown |
EU countries |
|
13h |
2006 |
multiple data sets |
| TED [37] |
Lamel |
ELDA |
E |
40(188) |
many |
|
10h(47h) |
1994 |
eurospeech 93 |
| TLTS [38] |
|
DARPA |
A |
|
E |
|
1h |
2004 |
|
| Tokyo-Kikuko [39] |
|
U. Tokyo |
J |
140 |
10 countries |
35000 |
|
2004 |
proficiency rating |
| Verbmobil [40] |
|
U. Munich |
E |
44 |
G |
|
1.5h |
1994 |
very spontaneous |
| VODIS [41] |
|
EU |
F G |
178 |
F G |
2500 |
|
1998 |
about car navigation |
| WP Arabic [42] |
Rocca |
LDC |
A |
35 |
E |
800 |
1h |
2002 |
|
| WP Russian [43] |
Rocca |
LDC |
R |
26 |
E |
2500 |
2h |
2003 |
|
| WP Spanish [44] |
Morgan |
LDC |
S |
|
E |
|
|
2006 |
|
| WSJ Spoke [45] |
|
|
E |
10 |
unknown |
800 |
|
1993 |
|
|
Close
Legend
In the table of non-native databases some abbreviations for language names are used. They are listed in Table 1. Table 2 gives the following information about each corpus: The name of the corpus, the institution where the corpus can be obtained, or at least further information should be available, the language which was actually spoken by the speakers, the number of speakers, the native language of the speakers, the total amount of non-native utterances the corpus contains, the duration in hours of the non-native part, the date of the first public reference to this corpus, some free text highlighting special aspects of this database and a reference to another publication. The reference in the last field is in most cases to the paper which is especially devoted to describe this corpus by the original collectors. In some cases it was not possible to identify such a paper. In these cases a paper is referenced which is using this corpus is.
Some entries are left blank and others are marked with unknown. The difference here is that blank entries refer to attributes where the value is just not known. Unknown entries, however, indicate that no information about this attribute is available in the database itself. As an example, in the Jupiter weather database[46] no information about the origin of the speakers is given. Therefore this data would be less useful for verifying accent detection or similar issues.
Where possible, the name is a standard name of the corpus, for some of the smaller corpora, however, there was no established name and hence an identifier had to be created. In such cases, a combination of the institution and the collector of the database is used.
In the case where the databases contain native and non-native speech, only attributes of the non-native part of the corpus are listed. Most of the corpora are collections of read speech. If the corpus instead consists either partly or completely of spontaneous utterances, this is mentioned in the Specials column.