Machine Learning Repository
Datasets for Data Mining and Data Science
See also Data repositories AssetMacro, historical data of Macroeconomic Indicators and Market Data. Awesome Public Datasets on github, curated by caesar0301. AWS (Amazon Web Services) Public Data Sets, provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. BigML big list of public data sources. Related
OpeNER - Webservices
Input Tools This collection of components is used to start OpeNER pipelines. For now a language identifier is available. Language Identifier Language identifier receives plain text and outputs the language of the input text. The identified language can be used as a parameter to the OpeNER modules that require a language parameter. More information about the webservice can be found at its endpoint. Basics These components are the start of each OpeNER pipeline. Tokenizer The tokenizer receives plain text as input and a language parameter. More information about the webservice can be found at its endpoint. POS Tagger Part of Speech Tagging means identifying whether each word is a noun, a verb, etc. More information about the webservice can be found at its endpoint. Tree Tagger This tool implements a wrapper for TreeTagger ( allowing to apply this tagger to KAF files and obtain the result also in KAF format. NER/NED/Co-reference Coreference
Graphs
please contact Christian Sommer for comments and questions, or if you have other data sets.last update April 2010 used for shortest path queries, DIMACS means 9th DIMACS Implementation Challenge - Shortest Paths DBLP graph The DBLP Computer Science Bibliography co-author graph largest connected component Web graph WebGraph by the Laboratory for Web Algorithmics link graph interpreted as undirected graph (in which case it is already connected) Router topology CAIDA's Router-Level Topology Measurements "The [...] data file holds link directions corresponding to the traceroute directions." second file (itdk0304_rlinks_undirected), interpreted as undirected graph, largest connected component Citation graph KDD competition, citation graph of the hep-th portion of the arXiv hep-th citations tarball, interpreted as undirected graph, largest connected component Database of Interacting Proteins BioGRID DIMACS format copied from DIMACS
StatLib---Datasets Archive
If you have an interesting dataset, or collection of data from a book, please consider submitting the data. To submit a dataset, please see the submissions guidelines, via Some of the entries are shar archives. The datasets archive currently contains: NIST Statistical Reference Datasets (StRD) A pointer to a NIST site that contains reference datasets for the objective evaluation of the computational accuracy of statistical software. agresti Contains data from "An Introduction to Categorical Data Analysis," by Alan Agresti, John Wiley, 1996, plus SAS code for various analyses. Aldrich_Nelson.zip This data is used in the following book: Aldrich, J. and Forrest, N. (1984) "Linear Probability, Logit and Probit Models". alr This file contains data from Applied Linear Regression, 2nd Edition, by Sanford Weisberg, John Wiley, 1985 (sandy@umnstat.stat.umn.edu) (36808 bytes) analcatdata A collection of the data sets used in the book "Analyzing Categorical Data," by Jeffrey S. Andrews Arsenic arsenic.zip
50 Resources for Getting the Most Out of Google Analytics
Google Analytics is a very useful free tool for tracking site statistics. For most users, however, it never becomes more than just a pretty interface with interesting graphs. The resources below will help anyone, from the beginner to those who have been using Google Analytics for some time, learn how to get the most out of this great tool. For Beginners The following list of links will help you get started with Google Analytics from setup to understanding what data is being presented by Google Analytics. How to Use Google Analytics for Beginners – Mahalo’s how-to guide for beginners. Tips & Tricks If you’re already fairly familiar with Google Analytics and you’re ready to dig deeper and learn more about how to make use of the information that is available to you with Google Analytics, this list of tips & tricks is for you. Plugins, Hacks & Additions Want to learn how to get even more out of and extend Google Analytics by extending it with third party plugins, additions and hacks?
stop
| A Spanish stop word list. Comments begin with vertical bar.
Data + Design
Running your own study to collect data is not the only or best way to start your data analysis. Using someone else’s dataset and sharing your data is on the rise and has helped advance much of the recent research. Using external data offers several benefits: Where to Find External Data All those benefits sound great! Public Data Once you have a better idea of what you’re looking for in an external dataset, you can start your search at one of the many public data sources available to you, thanks to the open content and access movement that has been gaining traction on the Internet. If you decide to use a search engine (like Google) to look for datasets, keep in mind that you’ll only find things that are indexed by the search engine. If you’re not sure what to do with a particular type of data, try browsing through the Information is Beautiful awards for inspiration. Non-Public Data Of course, not all data is public. Assessing External Data Using External Data
Public Data Sets on AWS
Click here for the detailed list of available data sets. Here are some examples of popular Public Data Sets: NASA NEX: A collection of Earth science data sets maintained by NASA, including climate change projections and satellite images of the Earth's surface Common Crawl Corpus: A corpus of web crawl data composed of over 5 billion web pages 1000 Genomes Project: A detailed map of human genetic variation Google Books Ngrams: A data set containing Google Books n-gram corpuses US Census Data: US demographic data from 1980, 1990, and 2000 US Censuses Freebase Data Dump: A data dump of all the current facts and assertions in the Freebase system, an open database covering millions of topics The data sets are hosted in two possible formats: Amazon Elastic Block Store (Amazon EBS) snapshots and/or Amazon Simple Storage Service (Amazon S3) buckets. If you have any questions or want to participate in our Public Data Sets community, please visit our Public Data Sets forum.
Common Google Universal Analytics Mistakes that kill your Analysis & Conversions
I have audited hundreds of web analytics accounts and profiles. And each account/view had at least one or two issues which seriously stood in my way of getting optimum results from my analysis. I have put all of these issues into five broad categories: Directional Issues Data Collection Issues Data Integration issues Data Interpretation Issues Data Reporting Issues These are the most common mistakes that kill your analysis, reporting and conversions. In order to get optimum results from your analysis of Universal Analytics reports you must aim to find and fix as many of these issues as possible. Failing to do so will almost always result in inaccurate analysis, interpretation and reporting. 1. These issues are not associated with Google Universal Analytics or any other analytics software you use but are commonly found in analysts themselves and are reflected in the way they set up Google Analytics account, advanced segment, conversions segments, filters and custom reports. For example: 1. 2.