background preloader

OpenIntro

OpenIntro

Mining of Massive Datasets The book has now been published by Cambridge University Press. The publisher is offering a 20% discount to anyone who buys the hardcopy Here. By agreement with the publisher, you can still download it free from this page. --- Jure Leskovec, Anand Rajaraman (@anand_raj), and Jeff Ullman Download Version 2.1 The following is the second edition of the book, which we expect to be published soon. There is a revised Chapter 2 that treats map-reduce programming in a manner closer to how it is used in practice, rather than how it was described in the original paper. Version 2.1 adds Section 10.5 on finding overlapping communities in social graphs. Download the Latest Book (511 pages, approximately 3MB) Download chapters of the book: Download Version 1.0 The following materials are equivalent to the published book, with errata corrected to July 4, 2012. Download the Book as Published (340 pages, approximately 2MB) Gradiance Support Other Stuff Jure's Materials from the most recent CS246.

Data Science Wars: Python vs. R As I frequently travel in data science circles, I’m hearing more and more about a new kind of tech war: Python vs. R. I’ve lived through many tech wars in the past, e.g. Windows vs. Linux, iPhone vs. While R has traditionally been the programming language of choice for data scientists, some believe it is ceding ground to Python. R is Too Complex The most frequently stated argument I’ve heard is the view that Python is general purpose and comparatively easy to learn whereas R remains a somewhat complex programming environment to master. When I first learned R, I did not find it particularly complex; it was a lot easier for me to learn R than C++ or Java with their mammoth frameworks. R Isn’t Really a Language Another argument says that part of the reason people struggle to learn R is that it’s not really a language. Python is More Approachable Some feel that Python is more approachable. Remember, R is a very old statistical environment that has an incredible global following.

Interactive Statistical Calculation Pages Detecting multicollinearity using variance inflation factors | STAT 501 - Regression Methods Printer-friendly version Okay, now that we know the effects that multicollinearity can have on our regression analyses and subsequent conclusions, how do we tell when it exists? That is, how can we tell if multicollinearity is present in our data? Some of the common methods used for detecting multicollinearity include: The analysis exhibits the signs of multicollinearity — such as, estimates of the coefficients vary from model to model. Looking at correlations only among pairs of predictors, however, is limiting. What is a variation inflation factor? As the name suggests, a variance inflation factor (VIF) quantifies how much the variance is inflated. Let's be a little more concrete. it can be shown that the variance of the estimated coefficient bk is: Note that we add the subscript "min" in order to denote that it is the smallest the variance can be. Let's consider such a model with correlated predictors: How much larger? An example the matrix plot of BP, Dur, Pulse, and Stress:

Resources for Statistical Computing Other Resources for Help with Statistical Computing The primary mission of the IDRE Statistical Consulting Group is to support UCLA researchers in statistical computing using statistical packages like SAS, Stata, SPSS, HLM, MLwiN, Mplus and so forth. We provide this support through our web pages, our walk in consulting services, classes and seminars, and email consulting. Below, we provide a list of commonly used statistical software packages along with sources of support, including newsgroups/mailing lists, web pages provided by the vendors, and the vendor's technical support email address. Other lists news:sci.stat.consult - General issues in statistics. The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.

Crime data exploration in R using ggplot2 - Active Analytics Introduction The purpose of this blog post is to outline some exploratory plots using crime data, available from data.gov.uk website and the ggplot2 package in R. The ggplot2 package is a plotting and graphics package written for R by Hadley Wickham. Its great looking plots and impressive flexibility have made it a popular amongst R coders. Though this blog post has been created for crime data, the principles can be extended to analysis of many different data sets. Before I begin there are two items to cover: 1. 2. The Data The data used in this plotting tutorial was from the data.gov.uk website. #We load some packages # Our plotting tool require(ggplot2) # For arranging the plots require(gridExtra) # For manipulating the plot scales require(scales) # For generting our svg files require(grDevices) options("stringsAsFactors" = TRUE) # Path to the folder holding the data csv path <- "C:\\ btpData <- read.csv(file = paste(path, "BTP-Dec-2012.csv", sep = ""), header = TRUE) The dimensions of the table ...

Top 100 R packages for 2013 (Jan-May)! What are the top 100 (most downloaded) R packages in 2013? Thanks to the recent release of RStudio of their “0-cloud” CRAN log files (but without including downloads from the primary CRAN mirror or any of the 88 other CRAN mirrors), we can now answer this question (at least for the months of Jan till May)! By relying on the nice code that Felix Schonbrodt recently wrote for tracking packages downloads, I have updated my installr R package with functions that enables the user to easily download and visualize the popularity of R packages over time. Top 8 most downloaded R packages – downloads over time Let’s first have a look at the number of downloads per day for these 5 months, of the top 8 most downloaded packages (click the image for a larger version): We can see the strong weekly seasonality of the downloads, with Saturday and Sunday having much fewer downloads than other days. “Family tree” of the top 100 most downloaded R packages Such analysis can (and should!) R code Related

swirl - Instructors swirl is a platform for teaching R programming and data science. However, an educational platform is only as good as the content it delivers to students. Although we have contributed some content ourselves, swirl is designed in such a way that you can create your own interactive content and share it freely with students in your classroom or around the world. The swirlify R package provides a comprehensive toolbox for swirl instructors. Step 1: Get R In order to run swirl and swirlify, you must have R 3.0.2 or later installed on your computer. If you need to install R, you can do so here. For help installing R, check out one of the following videos (courtesy of Roger Peng at Johns Hopkins Biostatistics): Step 2 (recommended): Get RStudio In addition to R, it’s highly recommended that you install RStudio, which will make your experience with R much more enjoyable. If you need to install RStudio, you can do so here. Step 3: Install swirl and swirlify Step 4: Start swirlify To create a new lesson:

Related: