Scraping for Journalism: A Guide for Collecting Data Photo by Dan Nguyen/ProPublica Our Dollars for Docs news application lets readers search pharmaceutical company payments to doctors. We’ve written a series of how-to guides explaining how we collected the data. Most of the techniques are within the ability of the moderately experienced programmer. The most difficult-to-scrape site was actually a previous Adobe Flash incarnation of Eli Lilly’s disclosure site. These recipes may be most helpful to journalists who are trying to learn programming and already know the basics. If you are a complete novice and have no short-term plan to learn how to code, it may still be worth your time to find out about what it takes to gather data by scraping web sites -- so you know what you’re asking for if you end up hiring someone to do the technical work for you. The tools With the exception of Adobe Acrobat Pro, all of the tools we discuss in these guides are free and open-source. Ruby – The programming language we use the most at ProPublica.
ACM KDD CUP Skip to Main Content Area Home » Awards » KDD Cup KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining, the leading professional organization of data miners. Below are links to the descriptions of all past tasks. KDD Cup 2010: Student performance evaluation KDD Cup 2009: Customer relationship prediction KDD Cup 2008: Breast cancer KDD Cup 2007: Consumer recommendations KDD Cup 2006: Pulmonary embolisms detection from image data KDD Cup 2005: Internet user search query categorization KDD Cup 2004: Particle physics; plus protein homology prediction KDD Cup 2003: Network mining and usage log analysis KDD Cup 2002: BioMed document; plus gene role classification KDD Cup 2001: Molecular bioactivity; plus protein locale prediction KDD Cup 2000: Online retailer website clickstream analysis KDD Cup 1999: Computer network intrusion detection KDD Cup 1998: Direct marketing for profit optimization Latest Resources
Data Miners Blog Scraper un site en Ruby pour les nuls (ou presque) # encoding: UTF-8 require 'open-uri' require 'nokogiri' require 'csv' # Nettoie les caractères inutiles dans une chaine def clean str str.strip.gsub("\n", ' ').gsub(' ', ' ').gsub(' ', ' ').gsub(' ', ' ').gsub(' ', ' ').gsub(' ', ' ').gsub(' ', ' ') end # les types de décisions # on va écrire dans ce fichier CSV.open("conseil_constitutionel.csv", "w") do |csv| # l'entête csv << ["Année", "Numéro", "Date", "N°", "Type", "Intitulé", "Décision", "URL"] # le point d'entrée main_url = " # dans cette page on récupère tous les liens qui sont dans le div #articlesArchives qui vont correspondre aux pages listant les décisions Nokogiri::HTML(open(main_url)).search('#articlesArchives a').each do |a| # le contenu du lien corespond à l'année year = a.inner_text Nokogiri::XML(open(url_decision), nil, 'UTF-8').search('#articles li').each do |decision| if index_id
SEASR Analytics and Data Science (Kurt Thearling) How to use LinkedIn for data miners If you're new here, you may want to subscribe to my RSS feed. Thanks for visiting! After the article How to use twitter for data miners, let me propose advices on using LinkedIn. First, you may already know that your LinkedIn account can be linked to display your tweets (see this link). Continue by adding the right keywords in your summary, so that other data miners can find you easily. Example of terms are data mining, predictive analytics, knowledge discovery and machine learning. Continue by searching for other people with the same interests (use the same keywords as above). The next step is to participate to data mining groups, such as: ACM SIGKDDAdvanced Business Analytics, Data Mining and Predictive ModelingAnalyticBridgeBusiness AnalyticsCRISP-DMCustomers DNAData MinersData Mining TechnologyData Mining, Statistics, and Data VisualizationMachine Learning ConnectionOpen Source Data MiningSmartData Collective
I Datamining Twitter On its own, Twitter builds an image for companies; very few are aware of this fact. When a big surprise happens, it is too late: a corporation suddenly sees a facet of its business — most often a looming or developing crisis — flare up on Twitter. As always when a corporation is involved, there is money to be made by converting the problem into an opportunity: Social network intelligence is poised to become a big business. In theory, when it comes to assessing the social media presence of a brand, Facebook is the place to go. By comparison, Twitter more swiftly reflects the mood of users of a product or service. Datamining Twitter is not trivial. Companies such as DataSift (launched last month) exploit the Twitter fire hose by relying on the 40-plus metadata included in a post. …is a rich trove of data. …is a tear-down of a much larger one (here, on Krikorian’s blog) showing the depth of metadata associated to a tweet. Well, think about it next time you tweet from a Starbucks.