background preloader

Coding for Journalists 101 : A four-part series

Coding for Journalists 101 : A four-part series
Photo by Nico Cavallotto on Flickr Update, January 2012: Everything…yes, everything, is superseded by my free online book, The Bastards Book of Ruby, which is a much more complete walkthrough of basic programming principles with far more practical and up-to-date examples and projects than what you’ll find here. I’m only keeping this old walkthrough up as a historical reference. So check it out: The Bastards Book of Ruby -Dan Update, Dec. 30, 2010: I published a series of data collection and cleaning guides for ProPublica, to describe what I did for our Dollars for Docs project. So a little while ago, I set out to write some tutorials that would guide the non-coding-but-computer-savvy journalist through enough programming fundamentals so that he/she could write a web scraper to collect data from public websites. As the tutorials are aimed at people who aren’t experienced programming, the code is pretty verbose, pedantic, and in some cases, a little inefficient.

An Introduction to Compassionate Screen Scraping Screen scraping is the art of programatically extracting data from websites. If you think it's useful: it is. If you think it's difficult: it isn't. We're going to be doing this tutorial in Python, and will use the httplib2 and BeautifulSoup libraries to make things as easy as possible. Websites crash. For my blog, the error reports I get are all generated by overzealous webcrawlers from search engines (perhaps the most ubiquitous specie of screenscraper). This brings us to my single rule for socially responsible screen scraping: screen scraper traffic should be indistinguishable from human traffic. Cache feverently. Now, armed with those three guidelines, lets get started screen scraping. Setup Libraries First we need to install the httplib2 and BeautifulSoup libraries. sudo easy_install BeautifulSoup sudo easy_install httplib2 If you don't have easy_install installed, then you'll need to download them from their project pages at httplib2 and BeautifulSoup. Choosing a Scraping Target

Data journalism pt1: Finding data (draft – comments invited) {*style:<i>The following is a draft from a book about online journalism that I’ve been working on. I’d really appreciate any additions or comments you can make – particularly around sources of data and legal considerations </i>*} The first stage in data journalism is sourcing the data itself. Often you will be seeking out data based on a particular question or hypothesis (for a good guide to forming a journalistic hypothesis see Mark Hunter’s free ebook Story-Based Inquiry (2010)). On other occasions, it may be that the release or discovery of data itself kicks off your investigation. There are a range of sources available to the data journalist, both online and offline, public and hidden. national and local government; bodies that monitor organisations (such as regulators or consumer bodies); scientific and academic institutions; health organisations; charities and pressure groups; business; and the media itself. Private companies and charities Regulators, researchers and the media Live data

Network Graph - Fusion Tables Help Current limitations include: The visualization will only show relationships from the first 100,000 rows in a table. A filter can include rows from 100,001 or beyond in the table, but the graph will still not display them. Internet Explorer 8 and below are not supported. Each row of a table represents one relationship in the graph. The network graph shows each row as a line connecting a person and a dog. To create a Network Graph in the New look: [+] > Add chart Click the Network Graph button. To create a Network Graph in Classic: Experiment > Network Graph By default, the first two text columns will be selected as the source of nodes: Node column 1 and Node column 2. Adjust the Network Graph's display: Select a number column to act as a Weight factor for line length. Interact with the Network Graph: "Camera" zoom means nodes become bigger but not more or less numerous. Good to know: Multiple relationships between two nodes are summed into a thicker line. Try it yourself! Try it yourself!

Scraping for Journalism: A Guide for Collecting Data Photo by Dan Nguyen/ProPublica Our Dollars for Docs news application lets readers search pharmaceutical company payments to doctors. We’ve written a series of how-to guides explaining how we collected the data. Most of the techniques are within the ability of the moderately experienced programmer. These recipes may be most helpful to journalists who are trying to learn programming and already know the basics. If you are a complete novice and have no short-term plan to learn how to code, it may still be worth your time to find out about what it takes to gather data by scraping web sites -- so you know what you’re asking for if you end up hiring someone to do the technical work for you. The tools With the exception of Adobe Acrobat Pro, all of the tools we discuss in these guides are free and open-source. Google Refine (formerly known as Freebase Gridworks) – A sophisticated application that makes data cleaning a snap. Ruby – The programming language we use the most at ProPublica.

How to make your infographics accessible and SEO friendly at the same time Infographics are everywhere. Some good - some bad. But most creators don't stop to think how to make sure search engines can understand their infographic - or how people who can't see pictures can consume them (maybe because they rely on screen readers or have chosen not to download images to their mobile phone). The trick to make infographics accessible and SEO friendly is to ensure: they're chopped into relevant sections (ie not one big image),text is text (you should be able to select it with a mouse)if anything has to be shown as an image, you set appropriate ALT text (the flipside of this is that, if the image doesn't add any information, you DON'T set ALT text - I'll explain this below.) Making an infographic accessible There's lots of infographics out there. Also I should point out that I'm a crap HTML coder so if anyone can improve on this, do let me know. Separate images and text As it stands, that bottom left bit is just part of an enormous image. What now? OK, you're thinking.

guides/data-bulletproofing.md at master · propublica/guides 7 Classic Foundational Vis Papers You Might not Want to Publicly Confess you Don?t Know ? Fell in Love with Data (In my last post I introduced the idea of regularly posting research material in this blog as a way to bridge the gap between researchers and practitioners. Some people kindly replied to my call for feedback and the general feeling seems to be like: “cool go on! rock it! we need it!”. Ok, thanks guys your encouragement is very much needed. I love you all. Even if I am definitely not a veteran of infovis research (far from it) I started reading my first papers around the year 2000 and since then I’ve never stopped. come from the very early days of infovisare foundationalare cited over and overI like a lot Of course this doesn’t mean these are the only ones you should read if you want to dig into this matter. Advice: in order to really appreciate them you have to think they have all been written during the ’90s (some even in the ’80s!). Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods. Please don’t tell me you don’t know this one!

Getting Started With The Gephi Network Visualisation App – My Facebook Network, Part I A couple of weeks ago, I came across Gephi, a desktop application for visualising networks. And quite by chance, a day or two after I was asked about any tools I knew of that could visualise and help analyse social network activity around an OU course… which I take as a reasonable justification for exploring exactly what Gephi can do :-) So, after a few false starts, here’s what I’ve learned so far… First up, we need to get some graph data – netvizz – facebook to gephi suggests that the netvizz facebook app can be used to grab a copy of your Facebook network in a format that Gephi understands, so I installed the app, downloaded my network file, and then uninstalled the app… (can’t be too careful ;-) Once Gephi is launched (and updated, if it’s a new download – you’ll see an updates prompt in the status bar along the bottom of the Gephi window, right hand side) Open… the network file you downloaded. You can also generate views of the graph that show information about the network. Like this:

dataist blog: An inspiring case for journalists learning to code | Dan Nguyen pronounced fast is danwin About a year ago I threw up a long, rambling guide hoping to teach non-programming journalists some practical code. Looking back at it, it seems inadequate. Actually, I misspoke, I haven’t looked back at it because I’m sure I’ll just spend the next few hours cringing. For example, what a dumb idea it was to put everything from “What is HTML” to actual Ruby scraping code all in a gigantic, badly formatted post. The series of articles have gotten a fair number of hits but I don’t know how many people were able to stumble through it. Mapping of Ratata blogging network by Jens Finnäs of dataist.wordpress.com I hope other non-coders who are still intimidated by the thought of learning programming are inspired by Finnas’s example. ProPublica’s Dollars for Docs project originated in part from this Pfizer-scraping lesson I added on to my programming tutorial: I needed a timely example of public data that wasn’t as useful as it should be. In fact, just knowing to avoid taking notes like this:

Statistics: Making Sense of Data About the Course We live in a world where data are increasingly available, in ever larger quantities, and are increasingly expected to form the basis for decisions by governments, businesses, and other organizations, as well as by individuals in their daily lives. To cope effectively, every informed citizen must be statistically literate. This course will provide an intuitive introduction to applied statistical reasoning, introducing fundamental statistical skills and acquainting students with the full process of inquiry and evaluation used in investigations in a wide range of fields. Course Syllabus A first look at dataWeeks 1-2: Summary statistics and graphical displays for a single categorical or quantitative variable and for relationships between two variables. Collecting dataWeek 2: Sampling. ProbabilityWeek 3: Probability models, the normal distribution, the Law of Large Numbers, the Central Limit Theorem, sampling distributions. Recommended Background Course Format

Data journalism: 22 key links « Simon Rogers Simon Rogers Data journalism and other curiosities Search Subscribe to RSS you're reading... Data journalism, Data visualisation, How to guides Data journalism: 22 key links Posted by Simon Rogers ⋅ ⋅ 6 Comments Filed Under data, dataviz, how to I have been teaching basic data journalism for a while now – and these links are so useful I keep them with me every time. Free and simple viz tools Timelines Google Fusion Useful Links Useful mapping tools to help Share this: Like this: Like Loading... Related Borders and boundaries: 16 Google Fusion border files for you to useIn "Data journalism" How to make a map with Google Fusion tablesIn "Data journalism" Anyone can do it. About Simon Rogers Data journalist, writer, speaker. View all posts by Simon Rogers » « A data journalism workflow How to make a map with Google Fusion tables » Discussion 6 thoughts on “Data journalism: 22 key links” I know – I miss Many Eyes. Trackbacks/Pingbacks Leave a Reply About me Data journalist, writer, speaker. Free to share My Tweets Meta

Related: