User:Prosfilaes/Esperanto corpus

Introduction to the Corpus

This is a corpus of all the Esperanto material I found on Project Gutenberg on January 15th, 2011. I turned them all into lowercase. I turned a trailing "as", "is", "os", "us", or "u" to "i", and removed a trailing "jn", "j" or "n". (This introduced a couple errors in prepositions and the like; tamen, nun, kun, jen, en, etc. all lost their last n, and correlatives lost trailing u's.) Many texts are not independent by CFI standards; several books are translated by Edwin Grobe, several others are translations of Ibsen by Odd Tangerud, and a number are issues of the Esperantist. There probably is a little bit of English or French left in, though I took care to strip the English parts of the Esperantist. Ultimately, this list must be used with a certain degree of care.

There were 49,249 "words" found at least once, and 18476 found at least twice. The numbered links are labeled by PG numbers, and link back to the book on PG's servers.

The first few red links were errors, but the first real missing word found is terura, with 60 incidents in our small corpus.