What we learned from 5 million books

http://www.ted.com/talks/what_we_learned_from_5_million_books.html

Culturomics 2.0: Forecasting large-scale human behavior using global news media tone in time and space Global geocoded tone of all Summary of World Broadcasts content January 1979–April 2011 mentioning “Bin Laden” (click to view animation). (Credit: UIC) Computational analysis of large text archives can yield novel insights into the functioning of society, recent literature has suggested, including predicting future economic events, says Kalev Leetaru, Assistant Director for Text and Digital Media Analytics at the Institute for Computing in the Humanities, Arts, and Social Science at the University of Illinois and Center Affiliate of the National Center for Supercomputing Applications. The emerging field of “Culturomics” seeks to explore broad cultural trends through the computerized analysis of vast digital book archives, offering novel insights into the functioning of human society, while books represent the “digested history” of humanity, written with the benefit of hindsight.

Joël de Rosnay : À la découverte du Web 5.0 Joël de Rosnay est un biologiste français, d’abord spécialiste des origines du vivant et des nouvelles technologies, puis en systémique et en prospective. Après trois ans de recherche et d’enseignement au MIT, il fut directeur des applications de la recherche à l’Institut Pasteur, puis directeur de la prospective et de l’évaluation de la Cité des sciences et de l’industrie de La Villette. Il a créé AgoraVox en mai 2005 et préside actuellement une société de conseil. Joël de Rosnay, Docteur en Sciences, est Directeur de la Prospective et de l’Evaluation de la Cité des Sciences et de l’Industrie de la Villette.

Królicza Nora Google: Le plus grand corpus linguistique de tous les temps Lorsque j'étais étudiant, à la fin des années 70, je n'aurais jamais osé imaginer, même dans mes rêves les plus fous, que la communauté scientifique ait un jour les moyens d'analyser des corpus de textes informatisés de plusieurs de centaines de milliards de mots. A l'époque, j'étais émerveillé par le Brown Corpus, qui comportait la quantité extraordinaire d'un million de mots d'anglais américain, et qui après avoir servi à la compilation de l'American Heritage Dictionary, avait été mis assez largement à disposition des chercheurs. Ce corpus, malgré sa taille, qui apparaît maintenant dérisoire, a permis une quantité impressionnante d'études et a contribué largement à l'essor des technologies du langage... J'ai eu la chance d'avoir pu accéder à l'étude avant publication, et j'ai eu quelque peu le vertige...

Culturomics Further reading[edit] References[edit] External links[edit] Culturomics.org, website by The Cultural Observatory at Harvard directed by Erez Lieberman Aiden and Jean-Baptiste Michel In 500 Billion Words, a New Window on Culture The digital storehouse, which comprises words and short phrases as well as a year-by-year count of how often they appear, represents the first time a data set of this magnitude and searching tools are at the disposal of Ph.D.’s, middle school students and anyone else who likes to spend time in front of a small screen. It consists of the 500 billion words contained in books published between 1500 and 2008 in English, French, Spanish, German, Chinese and Russian. The intended audience is scholarly, but a simple online tool allows anyone with a computer to plug in a string of up to five words and see a graph that charts the phrase’s use over time — a diversion that can quickly become as addictive as the habit-forming game Angry Birds. With a click you can see that “women,” in comparison with “men,” is rarely mentioned until the early 1970s, when feminism gained a foothold. The lines eventually cross paths about 1986. The data set can be downloaded, and users can build their own search tools.

Culturomics research uses quarter-century of media coverage to forecast human behavior "Culturomics" is an emerging field of study into human culture that relies on the collection and analysis of large amounts of data. A previous culturomic research effort used Google's culturomic tool to examine a dataset made up of the text of about 5.2 million books to quantify cultural trends across seven languages and three centuries. Now a new research project has used a supercomputer to examine a dataset made up of a quarter-century of worldwide news coverage to forecast and visualize human behavior.

Our adventures in culturomics Peter Aldhous, Jim Giles and MacGregor Campbell, reporters (Image: Michael St. Maur Sheil/Corbis) Here in New Scientist's San Francisco bureau we can't resist an invitation to participate in an entirely new field of research. Googlefight! by Avraham Roos Googlefight.com At first sight, googlefight seems like a total waste of time and (because of the fighting) even completely uneducational. Googlewhack A Googlewhack is a type of contest for finding a Google search query consisting of exactly two words without quotation marks, that returns exactly one hit. A Googlewhack must consist of two actual words found in a dictionary. A Googlewhack is considered legitimate if both of the searched-for words appear in the result page.

English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU Now we show the letter frequencies by position within word. That is, the frequencies for just the first letter in each word, just the second letter, and so on. We also show frequencies for positions relative to the end of the word: "-1" means the last letter, "-2" means the second to last, and so on. We can see that the frequencies vary quite a bit; for example, "e" is uncommon as the first letter (4 times less frequent than elsewhere); similarly "n" is 3 times less common as the first letter than it is overall. The letter "e" makes a comeback as the most common last letter (and also very common at 3rd and 5th letter places).

Ideas Illustrated » Blog Archive » Visualizing English Word Origins I have been reading a book on the development of the English language recently and I’ve become fascinated with the idea of word etymology — the study of words and their origins. It’s no secret that English is a great borrower of foreign words but I’m not enough of an expert to really understand what that means for my day-to-day use of the language. Simply reading about word history didn’t help me, so I decided that I really needed to see some examples. Using Douglas Harper’s online dictionary of etymology, I paired up words from various passages I found online with entries in the dictionary. For each word, I pulled out the first listed language of origin and then re-constructed the text with some additional HTML infrastructure.

Search engine data visualisations I’ve decided I need a single place to put all of the search engine data visuals that I’ve been working on. The visuals are made up of thousands of actual queries put into search engines by UK users over the course of a year. This gives us an idea of ‘search demand’ which can/may/should equal actual, offline demand for a topic. The Most Popular Words in the Most Viral Headlines 6.3K Flares Filament.io 6.3K Flares × There is no one way to create viral content. So many different variables go into a viral post—timing, emotion, engagement, and so many others that you cannot control. There is no viral blueprint. The greatest chance we have to understand viral content is to study the posts and places that do it best, figure out what worked for them, and try it for ourselves. Thanks to some incredible work by the team at Ripenn, we have access to headline analysis from four of the top viral sites on the web—who happen to be really good at headline writing.