Читаем A Frequency Dictionary of French (2009) полностью

Collection of the corpus involved much work in what has been called corpus standardization or text preprocessing. Given the wide range of sources for the corpus, they involved many different file types, character encodings, and formatting conventions. For example, the documents used a wide range of character representations and formats such as EBCDIC, MACROMAN, ISO, UTF-8, and HTML. In many cases unneeded material such as images, advertisements, or templatic information had to be stripped out, a process called document scrubbing.

Each type of transcription or text document was then processed so that the paragraphs, sentences, words, and characters were identified and encoded in a standard way to enable further processing, a process called tokenization. The scrubbing and tokenization processes involve linguistic issues that had to be addressed, such as deciding on how to break up

Page 3

Table 1 Composition of 23 million word French corpus

Spoken

Approx. # of

Sources

words

175

Conversations (3)

3,750,000

Canadian Hansard (4)

3,020,000

Misc. interviews/transcripts (5)

1,000,000

European Union parliamentary debates (6)

855

Telephone conversations (7)

470

Theatre dialogue/monologue (8)

2,230,000

Film subtitles (9)

TOTAL

11,500,000

Written

3,000,000

Newswire stories (10)

2,015,000

Newspaper stories (11)

(123hfw

s

4,734,000

Literature (fiction, non-fiction) (12)

434

Popular science magazine articles (13)

1,317,000

Newsletters, tech reports, user manuals

(14)

TOTAL

11,500,000

GRAND TOTAL

23,000,000

3 The French portion of the C-ORAL-ROM corpus (Cresti & Moneglia 2005).

4 Aligned Hansards of the 36th Parliament of Canada; for more information consult

http://www.isi.edu/natural-language/download/hansard/.

5 Miscellaneous transcripts of interviews with various business, political, artistic, and academic personalities mined from hundreds of Internet sites. Many were from media sites such as French television studios (e.g. www.tf1.fr and www.france2.fr), publishing houses (www.lonergan.fr), popular culture websites (e.g. www.evene.fr), and business information portals (e.g.

http://www.journaldunet.com).

6 A small random sampling from the French portion of the Multilingual Corpora for Cooperation (MLCC) corpus. See resource W0023 at www.elda.fr.

7 Aligned transcribed training data from the ESTER Phase 2 evaluation campaign; downloaded from http://www.irisa.fr/metiss/guig/ester/.

8 A small random sampling of extracts from theatrical works posted at various sites including www.leproscenium.fr.

9 Selected portions of several film subtitles from Jörg Tiedemann’s OPUS corpus; downloaded from http://urd.let.rug.nl/tiedeman/OPUS/OpenSubtitles.php.

10 A tiny random sampling of stories from the French GigaWord corpus; for more information see http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T17.

11 A sampling from newspaper articles on the Internet from journalism sites throughout the French-speaking world (e.g. www.lemonde.fr, www.ledevoir.com).

12 Samples and complete short works of fiction and non-fiction works from various publishing houses (e.g. www.edition-grasset.fr, www.lonergan.fr) and Web virtual libraries (e.g.

www.gutenberg.org).

13 A variety of articles from popular science magazine sites on the Internet (e.g.

www.pourlascience.com, www.larecherche.com, etc.).

14 A variety of technical report and newsletter articles including weather bulletins, user manuals, business newsletters, and banking correspondence. Some of these materials are sampled from the hP

S

apT

French portions of the European Corpus Initiative (see

http://wwww.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T17).

Page 4

words separated by hyphens (dis-moi vs. week-end) and apostrophes (l’homme vs. aujourd’hui).

Some documents had accented upper letters whereas others did not, so the process of case folding –

or reducing capitalized words to their lower-case form – was also nontrivial. Many special symbols including degree signs, ellipsis punctuation, currency symbols, bullets, and dots also required standardization. To perform all of this work we used several file conversion programs as well as our own Perl scripts, Unix tools (e.g. make, awk, grep, sort, uniq, join, comm), and

SGML/HTML/XML parsers.

Перейти на страницу:

Похожие книги

Агония и возрождение романтизма
Агония и возрождение романтизма

Романтизм в русской литературе, вопреки тезисам школьной программы, – явление, которое вовсе не исчерпывается художественными опытами начала XIX века. Михаил Вайскопф – израильский славист и автор исследования «Влюбленный демиург», послужившего итоговым стимулом для этой книги, – видит в романтике непреходящую основу русской культуры, ее гибельный и вместе с тем живительный метафизический опыт. Его новая книга охватывает столетний период с конца романтического золотого века в 1840-х до 1940-х годов, когда катастрофы XX века оборвали жизни и литературные судьбы последних русских романтиков в широком диапазоне от Булгакова до Мандельштама. Первая часть работы сфокусирована на анализе литературной ситуации первой половины XIX столетия, вторая посвящена творчеству Афанасия Фета, третья изучает различные модификации романтизма в предсоветские и советские годы, а четвертая предлагает по-новому посмотреть на довоенное творчество Владимира Набокова. Приложением к книге служит «Пропащая грамота» – семь небольших рассказов и стилизаций, написанных автором.

Михаил Яковлевич Вайскопф

Языкознание, иностранные языки