background preloader

R-Function GScholarScraper

R-Function GScholarScraper

Google Scholar: Part2 get_google_scholar_df <- function(u) { html <- getURL(u) doc <- htmlParse(html) GS_xpathSApply <- function(doc, path, FUN) { path.base <- "/html/body/div[@class='gs_r']" nodes.len <- length(xpathSApply(doc, "/html/body/div[@class='gs_r']")) paths <- sapply(1:nodes.len, function(i) gsub( "/html/body/div[@class='gs_r']", paste("/html/body/div[@class='gs_r'][", i, "]", sep = ""), path, fixed = TRUE)) xx <- sapply(paths, function(xpath) xpathSApply(doc, xpath, FUN), USE.NAMES = FALSE) xx[sapply(xx, length)<1] <- NA xx <- as.vector(unlist(xx)) return(xx) df <- data.frame( footer = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']", xmlValue), title = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3", xmlValue), type = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3/span", xmlValue), publication = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_a']", xmlValue), stringsAsFactors = FALSE) df <- df[,-1]

Creating beautiful maps with R Spanish R user and solar energy lecturer Oscar Perpiñán Lamigueiro has written a detailed three-part guide to creating beautiful maps and choropleths (maps color-coded with regional data) using the R language. Motivated by the desire to recreate this graphic from the New York Times, Oscar describes how he creates similar high-quality maps using R. In Part 1, Oscar grabbed some voting data from Spanish elections and shapefiles (detailed map coordinates) from Spain's Official Statistics department (INE) and set to work in R. He then used the maptools package and lattice graphics to create this choropleth of voting patterns in Spain: In Part 2, he uses the raster package to import pixel-based environmental data from NASA, to create a map that combines population density, land use and land cover in a single map. Finally in Part 3, Oscar uses the pxR package to read map data in the PC-Axis format, and the gridSVG package to create an interactive SVG map. Omnia sunt Communia!

Interactive HTML presentation with R, googleVis, knitr, pandoc and slidy Tonight I will give a talk at the Cambridge R user group about googleVis. Following my good experience with knitr and RStudio to create interactive reports, I thought that I should try to create the slides in the same way as well. Christopher Gandrud's recent post reminded me of deck.js, a JavaScript library for interactive html slides, which I have used in the past, but as Christopher experienced, it is currently not that straightforward to use with R and knitr. Thus, I decided to try slidy in combination with knitr and pandoc. And it worked nicely. I used RStudio again to edit my Rmd-file and knitr to generate the Markdown md-file output. pandoc -s -S -i -t slidy --mathjax Cambridge_R_googleVis_with_knitr_and_RStudio_May_2012.md -o Cambridge_R_googleVis_with_knitr_and_RStudio_May_2012.html Et volià, here is the result: To leave a comment for the author, please follow the link and comment on his blog: mages' blog.

Web Scraping Google URLs Google slightly changed the html code it uses for hyperlinks on search pages last Thursday, thus causing one of my scripts to stop working. Thankfully, this is easily solved in R thanks to the XML package and the power and simplicity of XPath expressions: Lovely jubbly! P.S. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Maps with R (III) In my previous posts (1 and 2) I wrote about maps with complex legends but without any kind of interactivity. In this post I show how to produce an SVG file with interactive functionalities with the gridSVG package. As an example, I use a dataset about population from the Spanish Instituto Nacional de Estadística. This organism publishes information using the PC_Axis format which can be imported into R with the pxR package. This dataset is available at the INE webpage or directly here. Let’s start loading packages. Then I read the px file and make some changes to get the datWide data.frame (which is available here if you are not interested in the PC-Axis details). Now it’s time to read a suitable shapefile (read the first post of this series for information about it). Both the shapefile and the data.frame have to be combined using the matches between the PROV variable of the shapefile and code from the data.frame. A final step is needed before calling spplot.

Making interactive slides with Org mode and googleVis in R There’s been a lot of justifiable excitement in the R community about Yihui Xie’s great work, and most recently the incorporation of his knitr package into the RStudio software. Knitr is seen, justifiably, as a worthy successor to SWeave for dynamic, beautiful report generation. It is all that, but as an Org mode user, I already have something better than Sweave for both reproducible research and literate programming, which works with more than 30 different computer languages, not just R. But then Markus Gesmann wrote a interesting blog post about using knitr and the googleVis package to produce interactive HTML presentations by converting the knit-produced markdown to Slidy, and I wanted to do the same in Org mode. Org mode can easily export to HTML, and there are several documented options for creating slide shows using HTML export or a variant of it. So instead I’m using org-slidy, which exports to Slidy, the same format Markus used. Any R code source blocks can be done as usual.

Yahoo Search Page via XPath library(RCurl) library(XML) get_yahoo_search_df <- function(u) { xpathSNullApply <- function(doc, path.base, path, FUN, FUN2 = NULL) { nodes.len <- length(xpathSApply(doc, path.base)) paths <- sapply(1:nodes.len, function(i) gsub( path.base, paste(path.base, "[", i, "]", sep = ""), path, fixed = TRUE)) xx <- lapply(paths, function(xpath) xpathSApply(doc, xpath, FUN)) if(! xx[sapply(xx, length)<1] <- NA xx <- as.vector(unlist(xx)) return(xx) html <- getURL(u, followlocation = TRUE) doc <- htmlParse(html) path.base <- "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li" df <- data.frame( title = xpathSNullApply(doc, path.base, "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li/div/div/h3/a", xmlValue), stringsAsFactors = FALSE) free(doc) return(df) df <- get_yahoo_search_df(u) t(df[1:5, ])

Maps with R (I) This is the first post of a short series to show some code I have learnt to produce maps with R. Some time ago I found this infographic from The New York Times (via this page) and I wondered how a multivariate choropleth map could be produced with R. Here is the code I have arranged to show the results of the last Spanish general elections in a similar fashion. Some packages are needed: Let’s start with the data, which is available here (thanks to Emilio Torres, who “massaged” the original dataset, available from here). Each region of the map will represent the percentage of votes obtained by the predominant political option. The Spanish administrative boundaries are available as shapefiles at the INE webpage (~70Mb): (EDITED, following the question of Sandra). Then we shift the coordinates of the islands: and finally construct a new object binding the shifted islands with the peninsula: The last step before drawing the map is to link the data with the polygons: So let’s draw the map.

Big Data, R and SAP HANA: Analyze 200 Million Data Points and Later Visualize in HTML5 Using D3 – Part III setkey(baa.hp.daily.flights,Year, Month, DayofMonth, Origin) baa.hp.daily.flights.delayed <- baa.hp[DepDelay>15, list( DelayedFlights=length(DepDelay), WeatherDelayed=length(WeatherDelay[WeatherDelay>0]), AvgDelayMins=round(sum(DepDelay, na.rm=TRUE)/length(DepDelay), digits=2), CarrierCaused=round(sum(CarrierDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2), WeatherCaused=round(sum(WeatherDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2), NASCaused=round(sum(NASDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2), SecurityCaused=round(sum(SecurityDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2), LateAircraftCaused=round(sum(LateAircraftDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2) ), by=list(Year, Month, DayofMonth, Origin)] setkey(baa.hp.daily.flights.delayed, Year, Month, DayofMonth, Origin) # Merge two data-tables baa.hp.daily.flights.summary <- baa.hp.daily.flights.delayed[baa.hp.daily.flights,list(Airport=Origin, # Merge with weather data sep="-"),"%Y-%m-%d")

Google+ via XPath Google+ just opened up to allow brands, groups, and organizations to create their very own public Pages on the site. This didn’t bother me to much but I’ve been hearing a lot about Google+ lately so figured it might be fun to set up an XPath scraper to extract information from each post of a status update page. I was originally going to do one for Facebook but this just seemed more interesting, so maybe I’ll leave that for next week if I get time. Anyway, here’s how it works (full code link at end of post): You simply supply the function with a Google+ post page url and it scrapes whatever information it can off of each post on the page. It doesn’t load more data after the initial set because I don’t really understand how to do it. <span role="button" title="Load more posts" tabindex="0" style="">More</span> but how one would use that is beyond me. The full code can be found here: P.S.

Maps with R (II) In my my last post I described how to produce a multivariate choropleth map with R. Now I will show how to create a map from raster files. One of them is a factor which will group the values of the other one. Thus, once again, I will superpose several groups in the same map. First let’s load the packages. Now, I define the geographical extent to be analyzed (approximately India and China). The first raster file is the population density in our planet, available at this NEO-NASA webpage (choose the Geo-TIFF floating option, ~25Mb). The second raster file is the land cover classification (available at this NEO-NASA webpage) The codes of the classification are described here. EDIT: Following a question from a user of rasterVis I include some lines of code to display this qualitative variable in the map. This histogram shows the distribution of the population density in each land class. Everything is ready for the map. And that’s all.

How to Make HTML5 Slides with knitr One week ago I made an early announcement about the markdown support in the knitr package and RStudio, and now the version 0.5 of knitr is on CRAN, so I'm back to show you how I made the HTML5 slides. For those who are not familiar with markdown, you may read the traditional documentation, but RStudio has a quicker reference (see below). The problem with markdown is that the original invention seems to be too simple, so quite a few variants was derived later (e.g. to support tables); that is another story, and you do not need to worry much about it. Before you get started, make sure your knitr version is at least 0.5: # install.packages(c('knitr', 'XML', 'RCurl')) update.packages(ask = FALSE) packageVersion('knitr') >= 0.5 Editor: RStudio You need to install the RStudio preview version to use its new features on markdown support. The button MD in the toolbar shows a quick reference of the markdown syntax, which I believe you can learn in 3 minutes. Converter: Pandoc Final words

GScholarScraper function with XPath Kay Cichini recently wrote a word-cloud R function called GScholarScraper on his blog which when given a search string will scrape the associated search results returned by Google Scholar, across pages, and then produce a word-cloud visualisation. This was of interest to me because around the same time I posted an independent Google Scholar scraper function get_google_scholar_df() which does a similar job of the scraping part of Kay’s function using XPath (whereas he had used Regular Expressions). My function worked as follows: when given a Google Scholar URL it will extract as much information as it can from each search result on the URL webpage into different columns of a dataframe structure. In the comments of his blog post I figured it’d be fun to hack his function to provide an XPath alternative, GScholarXScraper. I think that’s pretty much everything I added. // image The following is produced if we look at the ‘description’ filed instead of the ‘title’ field: //image Not bad. Code:

Great Maps with ggplot2 The above map (and this one) was produced using R and ggplot2 and serve to demonstrate just how sophisticated R visualisations can be. We are used to seeing similar maps produced with conventional GIS platforms or software such as Processing but I hadn’t yet seen one from the R community (feel free to suggest some in the comments). The map contains three layers: buildings, water and the journey segments. The most challenging aspect was to change the standard line ends in geom_segment from “butt” to “round” in order that the lines appeared continuous and not with “cracks” in, see below. I am grateful to Hadley and the rest of the ggplot2 Google Group for the solution. #if your map data is a shapefile use maptools library(maptools) gpclibPermit() geom_segment2 “identity”, position = “identity”, arrow = NULL, …) { GeomSegment2$new(mapping = mapping, data = data, stat = stat, position = position, arrow = arrow, …) } #load spatial data. #create base plot plon1 #create plot plon2 plon2

Related: