Where can I find large datasets open to the public?
Datasets for Data Mining, Analytics and Knowledge Discovery
See also Data repositories AssetMacro, historical data of Macroeconomic Indicators and Market Data. Related
Public Data Sets
A corpus of web crawl data composed of over 5 billion web pages. This data set is freely available on Amazon S3 and is released under the Common Crawl Terms of Use. Last Modified: Mar 17, 2014 17:51 PM GMT Three NASA NEX datasets are now available, including climate projections and satellite images of Earth. Last Modified: Nov 12, 2013 13:27 PM GMT The Ensembl project produces genome databases for human as well as over 50 other species, and makes this information freely available. Last Modified: Oct 8, 2013 14:38 PM GMT Last Modified: Oct 8, 2013 14:37 PM GMT Human Microbiome Project Data Set Last Modified: Sep 26, 2013 17:58 PM GMT The 1000 Genomes Project, initiated in 2008, is an international public-private consortium that aims to build the most detailed map of human genetic variation available. Last Modified: Jul 18, 2012 16:34 PM GMT Last Modified: Apr 24, 2012 21:18 PM GMT Last Modified: Mar 4, 2012 3:22 AM GMT Last Modified: Feb 15, 2012 2:22 AM GMT Last Modified: Jan 21, 2012 2:12 AM GMT
Publicly Available Big Data Sets :: Hadoop Illuminated
Public Data sets on Amazon AWS Amazon provides following data sets : ENSEMBL Annotated Gnome data, US Census data, UniGene, Freebase dump Data transfer is 'free' within Amazon eco system (within the same zone) AWS data sets InfoChimps InfoChimps has data marketplace with a wide variety of data sets. InfoChimps market place Comprehensive Knowledge Archive Network open source data portal platform data sets available on datahub.io from ckan.org Stanford network data collection Open Flights Crowd sourced flight data Flight arrival data
Data Sets
The Pew Research Center's Internet Project is pleased to offer scholars access to raw data sets from our research. All uses of this data should reference the Pew Research Center as the source of the data and acknowledge that the Pew Research bears no responsibility for interpretations presented or conclusions reached based on analysis of the data. Our data sets are made available as single compressed archive files (.zip file). Pew Research is interested in learning about other ways that scholars use our data. January 2014 – 25th Anniversary of the Web (Omnibus) This survey contains questions about internet usage, cell and smartphone ownership, and Americans’ views about the role of the internet in their lives. January 2014 – E-reading and Gadgets (Omnibus) This omnibus survey contains questions about reading, e-reading, and various electronic devices. October 2013 – Pictorial Activities (omnibus) July 2013 – Anonymity (omnibus) This omnibus survey contains questions about anonymity online.
How to access 100M time series in R in under 60 seconds
DataMarket, a portal that provides access to more than 14,000 data sets from various public and private sector organizations, has more than 100 million time series available for download and analysis. (Check out this presentation for more info about DataMarket.) And now with the new package rdatamarket, it's trivially easy to import those time series into R for charting, analysis, or anything. Here's what you need to do: Register an account on DataMarket.com (it's free)Install the rdatamarket package in R with install.packages("rdatamarket")Browse DataMarket.com for a time series of interest (I found this series on unemployment)Copy the URL of the page you're on (the short URL works too, I used " the dmseries function with the URL to extract the time series as a zoo object Here's an example: Created by Pretty R at inside-R.org With this package, you can go from finding interesting data on DataMarket to working with it in R in less than a minute.
IT Operations Analytics
In the fields of information technology and systems management, IT Operations Analytics (ITOA) is an approach or method applied to application software designed to retrieve, analyze and report data for IT operations. ITOA has been described as applying big data analytics to large datasets where IT operations can extract unique business insights.[1][2] In its Hype Cycle Report, Gartner rated the business impact of ITOA as being ‘high’, meaning that its use will see businesses enjoy significantly increased revenue or cost saving opportunities.[3] By 2017, Gartner predicts that 15% of enterprises will use IT operations analytics technologies to deliver intelligence for both business execution and IT operations.[2] Definition[edit] History[edit] Due the mainstream embrace of cloud computing and the increasing desire for businesses to adopt more Big Data practices, the ITOA industry has grown significantly since 2010. Applications[edit] Types[edit] Tools and ITOA Platforms[edit] See also[edit]
Machine Learning Repository
The 70 Online Databases that Define Our Planet
Back in April, we looked at an ambitious European plan to simulate the entire planet. The idea is to exploit the huge amounts of data generated by financial markets, health records, social media and climate monitoring to model the planet’s climate, societies and economy. The vision is that a system like this can help to understand and predict crises before they occur so that governments can take appropriate measures in advance. There are numerous challenges here. Today, we get a grand tour of this challenge from Dirk Helbing and Stefano Balietti at the Swiss Federal Institute of Technology in Zurich. It turns out that there are already numerous sources of data that could provide the necessary fuel to power Helbing’s Earth Simulator. While good data from social sciences experiments has been hard to come by in the past, researchers are currently swamped by it thanks to a new generation of lab experiments, web experiments and the study of massive multi-player on-line games. Where’s George?
Data Visualisation: What's the big deal? | Career and Hiring Insights | Aquent
The concept of using pictures to understand complex information — especially data — has been around for a very long time, centuries in fact. One of the most cited examples of statistical graphics is Napoleon’s invasion of Russia mapped by Charles Minard. The maps showed the size of the army and the path of Napoleon’s retreat from Moscow. It also included detailed information like temperature and time scales, providing the audience with an in-depth understanding of the event. However, as with most things, it’s technology that has truly allowed data visualisation to take the stage and get noticed. It’s no surprise that with big data there’s potential for BIG opportunity (someone pass me the shot glass), but many corporates are genuinely challenged when it comes to: understanding the data they have finding value in it getting the wider business to buy in and just GET IT!!! So how do you tackle this? How do you get people to comprehend this information quickly? One word — INSIGHT.