background preloader

Scraping for Journalism: A Guide for Collecting Data

Scraping for Journalism: A Guide for Collecting Data
Photo by Dan Nguyen/ProPublica Our Dollars for Docs news application lets readers search pharmaceutical company payments to doctors. We’ve written a series of how-to guides explaining how we collected the data. Most of the techniques are within the ability of the moderately experienced programmer. These recipes may be most helpful to journalists who are trying to learn programming and already know the basics. If you are a complete novice and have no short-term plan to learn how to code, it may still be worth your time to find out about what it takes to gather data by scraping web sites -- so you know what you’re asking for if you end up hiring someone to do the technical work for you. The tools With the exception of Adobe Acrobat Pro, all of the tools we discuss in these guides are free and open-source. Google Refine (formerly known as Freebase Gridworks) – A sophisticated application that makes data cleaning a snap. Ruby – The programming language we use the most at ProPublica.

Data Extraction Data Extraction and Web Scraping A key activity in web automation is the extraction of data from websites, also known as web scraping or screen scraping. Whether it is price lists, stock information, financial data or any other type of data, iMacros can extract this data for you and either re-use the data or store it in a file or database. iMacros can write extracted data to standard text files, including the comma separated value (.csv) format, readable by spreadsheet processing packages. The Extract command Data extraction is specified by an EXTRACT parameter in the TAG command. TAG POS=1 TYPE=SPAN ATTR=CLASS:bdytxt&&TXT:* EXTRACT=HTM This means that the syntax of the command is now the same as for the TAG command, with the type of extraction specified by the additional EXTRACT parameter. Creation of Extraction Tags Extraction Wizard Text Extraction Wizard The Extraction Wizard can be used to automatically generate and test extractions. To define an EXTRACT command proceed as follows: Example:

Jumper 2.0 ApexKB (formerly Jumper 2.0), is an open source web application script for collaborative search and knowledge management powered by a shared enterprise bookmarking engine that is a fork of KnowledgebasePublisher.[1] It was publicly announced on 29 September 2008.[2] A stable version of Jumper (version 2.0.1.1) was publicly released under the GNU General Public License and made available on Sourceforge on 26 March 2009 as a free software download.[3] Jumper is Enterprise 2.0 software that empowers users to compile and share collaborative bookmarks by crowdsourcing their knowledge, experience and insights using knowledge tags. Features[edit] Jumper 2.0 is enterprise web-infrastructure for tagging and linking information resources.[5] Jumper 2.0 lets you search and share high-value content, media, or data across remote locations using knowledge tags to capture knowledge about the information in distributed storage. It collects these tags in a tag profile. Advantages[edit] History[edit]

s | Password Haystacks: How Well Hidden is Your Needle? ... and how well hidden is YOUR needle? Every password you use can be thought of as a needle hiding in a haystack. After all searches of common passwords and dictionaries have failed, an attacker must resort to a “brute force” search – ultimately trying every possible combination of letters, numbers and then symbols until the combination you chose, is discovered. If every possible password is tried, sooner or later yours will be found. The question is: Will that be too soon . . . or enough later? This interactive brute force search space calculator allows you to experiment with password length and composition to develop an accurate and quantified sense for the safety of using passwords that can only be found through exhaustive search. <! (The Haystack Calculator has been viewed 8,794,660 times since its publication.) IMPORTANT!!! It is NOT a “Password Strength Meter.” Since it could be easily confused for one, it is very important for you to understand what it is, and what it isn't: Okay.

Le datajournalisme: vecteur de sens et de profits Face à l'avalanche d'informations, les techniques de datamining permettent d'extraire du sens de bases de données. La confiance devient la ressource rare, créatrice de valeur. Et les médias peuvent s'en emparer. Ce post reprend les éléments d’une réflexion amorcée avec Mirko Lorenz et Geoff McGhee dans un article intitulé Media Companies Must Become Trusted Data Hubs [en] et présentée à la conférence re:publica XI. Chaque jour, nous produisons deux ou trois exaoctets [en] de données, soit 1 million de téraoctets. Si l’on veut synthétiser toute l’information produite en quelque chose de digeste pour l’utilisateur final, il faut résumer par un facteur de 100 milliards. Pour faire sens de cette hyper-abondance de contenus, les professionnels de l’information doivent adopter de nouvelles techniques. Une fois équipé des bons outils, faire parler des masses de données devient possible. Toute information est une donnée Certaines initiatives vont dans ce sens. Médias liquides

Journalists' Toolkit Chapter 1. Using Google Refine to Clean Messy Data Google Refine (the program formerly known as Freebase Gridworks) is described by its creators as a “power tool for working with messy data” but could very well be advertised as “remedy for eye fatigue, migraines, depression, and other symptoms of prolonged data-cleaning.” Even journalists with little database expertise should be using Refine to organize and analyze data; it doesn't require much more technical skill than clicking through a webpage. For skilled programmers, and journalists well-versed in Access and Excel, Refine can greatly reduce the time spent doing the most tedious part of data-management. Other reasons why you should try Google Refine: It’s free.It works in any browser and uses a point-and-click interface similar to Google Docs.Despite the Google moniker, it works offline. There’s no requirement to send anything across the Internet.There’s a host of convenient features, such as an undo function, and a way to visualize your data’s characteristics. Photo by daniel.gene

How To Prepare Yourself For Online Education | Help & Teach Our Children How To Prepare Yourself For Online Education Online education refers to any learning process that is partially or completely done using computer technology for the delivery or support. The concept of online education is not new. Psychology Professor Sidney Pressey developed a teaching machine using mechanics in the 1920s. Many computer-based training (CBT) applications have been developed in the 1980s to exploit the evolution of the personal computer. The question here is how accredited online universities can provide you the best online degrees. To arrive at the best online university for you, you must ask yourself what discipline you want and then narrow the field and increase your chances of finding the right college online. Now that you have a short list of options you can start looking for colleges online that are related to the course of study. You will need to investigate the length of time the online university has been in existence over the type of resources they offer.

Understanding user-agent strings (Internet Explorer) Updated: July 2013 Here we discuss the user-agent string, which identifies your browser and provides certain system details to servers hosting the websites you visit. We'll also learn how to view your user-agent string, understand tokens used by recent versions of Windows Internet Explorer, and understand registry keys that affect the user-agent string. We'll cover these sections. Introduction When you visit a webpage, your browser sends the user-agent string to the server hosting the site that you are visiting. Because certain non-Microsoft sites add details to the user-agent string, it's important to understand the user-agent string. Understanding the user-agent string When you request a webpage, your browser sends a number of headers to the server hosting the site that you're visiting, as shown here. These headers occur during a negotiation process that helps the browser and the hosting server determine the best way to provide the requested information. Viewing the user-agent string

DATA: Without adding context, a journalist with data can be dangerous If you believe the predictions, 2011 will be the year when journalists have more access to data than ever before. Of course, much of the data will also be accessible to the public in general but I suspect more people will be exposed to data via journalism than will actively seek it themselves. And with that comes a responsibility to make sure that journalists present the full picture with a set of data. In other words, add some context. The old phrase about lies, lies and statistics can be true if one set of data is taken in isolation. Paul Bradshaw touched on this when looking at a story in November which ‘revealed’ that Birmingham had more CCTV cameras than any other council area. So the challenge for 2011 isn’t just making use of all the data that’s available, it’s making use of it responsibly, linking data together to come up with a true picture. If journalists don’t do this, then there will be people who do it for them, post publication. ... This data took 10 minutes to compile.

Eight Tools for effective explanation | Explainer.Net Explainers often have lofty goals in their subject matter, but we know that different people have different styles of learning. Explainers utilize many tools to break down complicated subjects beyond just a block of text, and we’ve collected eight of the best. Some of them are visual, interactive, or entertaining, but all of them help users easily digest intricate topics. Infographics Infographics are visual representations of data. Animation Whether you use After Effects, Flash, or good old paper and pencil, animation can be one of the most successful tools for visual explanation. Mapping As Google Maps and its competitors continue to evolve, and location-specific data becomes more readily available, the applications for mapping are growing. Timelines Just as mapping allows you to relate data between locations, timelines allow you to plot out distinct moments in time. Music Like animation, music is a primarily entertaining medium, but it can also encourage the retention of information.

Chapter 2: Reading Data from Flash Sites Flash applications often disallow the direct copying of data from them. But we can instead use the raw data files sent to the web browser. Adobe Flash can make data difficult to extract. This tutorial will teach you how to find and examine raw data files that are sent to your web browser, without worrying how the data is visually displayed. For example, the data displayed on this Recovery.gov Flash map is drawn from this text file, which is downloaded to your browser upon accessing the web page. Inspecting your web browser traffic is a basic technique that you should do when first examining a database-backed website. Background In September 2008, drug company Cephalon pleaded guilty to a misdemeanor charge and settled a civil lawsuit involving allegations of fraudulent marketing of its drugs. Cephalon's report is not downloadable and the site disables the mouse’s right-click function, which typically brings up a pop-up menu with the option to save the webpage or inspect its source code.

Google Launches Global Online Science Fair [Video] For years, employees at Google have suggested a project near and dear to their nerd hearts: a Google-led science fair. "It's come up over and over and over again," says Cristin Frodella, a senior product marketing manager in education at Google. After all, many a Googler has fond childhood memories of explaining the genius of his or her biology experiment to passersby in a school gym. (Frodella and her best friend trained hamsters to ask for food by ringing a bell.) Today those Googlers and budding scientists worldwide should be ecstatic. This is a far cry from your typical local science fair. Google's partners include National Geographic, CERN, Scientific American, and Lego. Cerf gave a brief history of the Internet, telling students that breakthroughs of that magnitude don't "just happen." William Kamkwamba, a self-taught scientist from Malawi, Africa, talked about the direct impact that science can have, not just on society at large, but on an individual community.

Recommended Search Engines-The Library Google alone is not always sufficient, however. Not everything on the Web is fully searchable in Google. Overlap studies show that more than 80% of the pages in a major search engine's database exist only in that database. For this reason, getting a "second opinion" can be worth your time. For this purpose, we recommend Yahoo! Table of features Some common techniques will work in any search engine. You may also wish to consult "What Makes a Search Engine Good?" How do Search Engines Work? Search engines do not really search the World Wide Web directly. Search engine databases are selected and built by computer robot programs called spiders. If a web page is never linked from any other page, search engine spiders cannot find it. After spiders find pages, they pass them on to another computer program for "indexing." Many web pages are excluded from most search engines by policy.

The Necessity of Data Journalism in the New Digital Community This is the second post in a series from Nicholas White, the co-founder and CEO of The Daily Dot. It used to be, to be a good reporter, all you had to do was get drunk with the right people. Sure, it helped if you could string a few words together, but what was really important was that when news broke, you could get the right person on the phone and get the skinny. Or when something scandalous was going down somewhere, someone would pick up the phone and call you. Increasingly today, in selecting and training reporters, the industry seems to focus on the stringing-words-together part. That’s not how we’re building our newsroom at The Daily Dot. One: Our very first newsroom hire, after our executive editor, was Grant Robertson, who’s not only a reporter and an editor, but also a programmer. We found it necessary to push early in this direction because of our unique coverage area and we’re in the fortunate position of being able to build our newsroom from scratch. How do we report on that?

Related: