CORPORA: 1.9 billion - 45 million words each: free online access
Note: click RETURN in the upper right-hand corner to return to this page, after clicking on any of the links below. The BYU Wikipedia corpus, which was released in early 2015, was created by Mark Davies (professor of linguistics at Brigham Young University). It contains 1.9 billion words in 4.4 million web pages, and you can search the entire corpus with the same type of queries as the other BYU corpora. More importantly, though, you can also quickly and easily create "virtual" corpora "on the fly" for any topic that you want, such as: biology, investments, Buddhism, psychology, cars, basketball.
Concordancers in ELT
This has enabled linguists to create and analyse huge corpora (collections of authentic language text) and to reassess the assumed rules regarding the way we use language and especially words. With the spread of the Internet, these corpora are now becoming available to any teacher or student with an Internet connection, opening up a vast resource for language learning. What is a concordancer?
Leeds collection of Internet corpora
The Internet corpora used here were developed using the same methodology as outlined in Sharoff, S. (2006) Creating general-purpose corpora using automated search engine queries. In Marco Baroni and Silvia Bernardini, (eds), WaCky! Working papers on the Web as Corpus. Gedit, Bologna, Steps 2 and 3 above use customised versions of tools from Marco Baroni's BootCat, which also has a very extensive description of installation requirements and tool functions. Have a look at them.
Words and phrases: frequency, genres, collocates, concordances, synonyms, and WordNet
Cambridge University Press
Language Research at Cambridge Cambridge University Press is committed to language research - the investigation of written and spoken English in order to understand more about how we use language. Our research helps to inform and improve our English Language Teaching resources. All of our authors and editors have access to the language research facilities at Cambridge. Our language research features in most of our materials. In particular, we use it to:
Words and phrases: frequency, genres, collocates, concordances, synonyms, and WordNet
Centre National de Ressources Textuelles et Lexicales
Frantext Issu de la base Frantext, le corpus Frantext « textes libres de droits » offre à la communauté scientifique, un large champ d’investigation où sont réunies 500 œuvres de la littérature française couvrant la période du 18e au 20e siècle. Le traitement informatique des données textuelles en format TEI XML a été réalisé par le laboratoire ATILF. L'interface de recherche permet d'effectuer des sélections au sein du corpus par genre de texte, auteur, période…
Sentence Examples
explorationdecorpus.corpusecrits.huma-num
OPUS - an open source parallel corpus
Hansard Corpus: British Parliament, 1803-2005