Collection of the corpus involved much work in what has been called corpus standardization or text preprocessing. Given the wide range of sources for the corpus, they involved many different file types, character encodings, and formatting conventions. For example, the documents used a wide range of character representations and formats such as EBCDIC, MACROMAN, ISO, UTF-8, and HTML. In many cases unneeded material such as images, advertisements, or templatic information had to be stripped out, a process called document scrubbing.
Each type of transcription or text document was then processed so that the paragraphs, sentences, words, and characters were identified and encoded in a standard way to enable further processing, a process called tokenization. The scrubbing and tokenization processes involve linguistic issues that had to be addressed, such as deciding on how to break up
Page 3
Table 1 Composition of 23 million word French corpus
Spoken
Approx. # of
Sources
words
175
Conversations (3)
3,750,000
Canadian Hansard (4)
3,020,000
Misc. interviews/transcripts (5)
1,000,000
European Union parliamentary debates (6)
855
Telephone conversations (7)
470
Theatre dialogue/monologue (8)
2,230,000
Film subtitles (9)
TOTAL
11,500,000
Written
3,000,000
Newswire stories (10)
2,015,000
Newspaper stories (11)
(123hfw
s
4,734,000
Literature (fiction, non-fiction) (12)
434
Popular science magazine articles (13)
1,317,000
Newsletters, tech reports, user manuals
(14)
TOTAL
11,500,000
GRAND TOTAL
23,000,000
3 The French portion of the C-ORAL-ROM corpus (Cresti & Moneglia 2005).
4 Aligned Hansards of the 36th Parliament of Canada; for more information consult
http://www.isi.edu/natural-language/download/hansard/.
5 Miscellaneous transcripts of interviews with various business, political, artistic, and academic personalities mined from hundreds of Internet sites. Many were from media sites such as French television studios (e.g. www.tf1.fr and www.france2.fr), publishing houses (www.lonergan.fr), popular culture websites (e.g. www.evene.fr), and business information portals (e.g.
http://www.journaldunet.com).
6 A small random sampling from the French portion of the Multilingual Corpora for Cooperation (MLCC) corpus. See resource W0023 at www.elda.fr.
7 Aligned transcribed training data from the ESTER Phase 2 evaluation campaign; downloaded from http://www.irisa.fr/metiss/guig/ester/.
8 A small random sampling of extracts from theatrical works posted at various sites including www.leproscenium.fr.
9 Selected portions of several film subtitles from Jörg Tiedemann’s OPUS corpus; downloaded from http://urd.let.rug.nl/tiedeman/OPUS/OpenSubtitles.php.
10 A tiny random sampling of stories from the French GigaWord corpus; for more information see http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T17.
11 A sampling from newspaper articles on the Internet from journalism sites throughout the French-speaking world (e.g. www.lemonde.fr, www.ledevoir.com).
12 Samples and complete short works of fiction and non-fiction works from various publishing houses (e.g. www.edition-grasset.fr, www.lonergan.fr) and Web virtual libraries (e.g.
www.gutenberg.org).
13 A variety of articles from popular science magazine sites on the Internet (e.g.
www.pourlascience.com, www.larecherche.com, etc.).
14 A variety of technical report and newsletter articles including weather bulletins, user manuals, business newsletters, and banking correspondence. Some of these materials are sampled from the hP
S
apT
French portions of the European Corpus Initiative (see
http://wwww.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T17).
Page 4
words separated by hyphens (dis-moi vs. week-end) and apostrophes (l’homme vs. aujourd’hui).
Some documents had accented upper letters whereas others did not, so the process of case folding –
or reducing capitalized words to their lower-case form – was also nontrivial. Many special symbols including degree signs, ellipsis punctuation, currency symbols, bullets, and dots also required standardization. To perform all of this work we used several file conversion programs as well as our own Perl scripts, Unix tools (e.g. make, awk, grep, sort, uniq, join, comm), and
SGML/HTML/XML parsers.