htmlToText(): Extracting Text from HTML via XPath Converting HTML to plain text usually involves stripping out the HTML tags whilst preserving the most basic of formatting. I wrote a function to do this which works as follows (code can be found on github): The above uses an XPath approach to achieve it’s goal. Another approach would be to use a regular expression. Regular Expressions One approach to achieving this is to use a smart regular expression which matches anything between “<” and “>” if it looks like a tag and rips it out e.g I got the regular expression in “pattern” in the code above from a quick google search which gave this webpage from 2004. I’m still learning regex and I must confess to finding this one slightly intimidating. This approach would require building more and more sophsiticated regular expressions, or filtering through a series of different regular expressions, to get the desired result when taking into account these diversions. XPath Another approach is to use XPath. It returned only three lines.
Video Boosts Brand Engagement, Site Visits Many marketers have moved past a direct-response-centric model for online display advertising, recognizing that despite low clickthrough rates, banner ads also have a branding effect. And research suggests that adding rich media or video to those banner ads can improve both types of response—increasing the likelihood users will click the ads as well as boosting the lingering brand awareness that results from viewing. Ad solution provider MediaMind found that web users in North America who were exposed to a campaign that included rich media display ads were nearly three times as likely as those who saw only standard banners to end up at a marketer’s website—either by clicking on the ad directly or by navigating to the site at a later date. Those exposed to banners that included online video were about 5.6 times as likely to visit a marketer's site as those exposed to standard banners. The branding effect was smaller, but still evident.
GScholarXScraper: Hacking the GScholarScraper function with XPath Kay Cichini recently wrote a word-cloud R function called GScholarScraper on his blog which when given a search string will scrape the associated search results returned by Google Scholar, across pages, and then produce a word-cloud visualisation. This was of interest to me because around the same time I posted an independent Google Scholar scraper function get_google_scholar_df() which does a similar job of the scraping part of Kay’s function using XPath (whereas he had used Regular Expressions). My function worked as follows: when given a Google Scholar URL it will extract as much information as it can from each search result on the URL webpage into different columns of a dataframe structure. In the comments of his blog post I figured it’d be fun to hack his function to provide an XPath alternative, GScholarXScraper. I think that’s pretty much everything I added. // image The following is produced if we look at the ‘description’ filed instead of the ‘title’ field: //image Not bad. Code:
The Birthday Problem May 16, 2012 Many of you have seen The Birthday Problem: Given a group of n people, what is the probability that someone shares a birthday? Here, we are only concerned with birth day and month (not year). The solution assumes that a person is equally born on any of the 365 days in the year, thus ignoring leap years. Let P(n) = the probability that someone shares a birthday in a group of n people and let Q(n) = the probability that everyone has unique birthdays. P(n) = 1 – Q(n) = 1 – (365*364*…*(365-n+1))/365^n. P(n) P(60) = 0.9941 –> in a room with 60 people, you are almost certain to have at least two people that share a birthday! The key assumption is that all birth dates are equally likely. This will, of course, change our answer above. On a side note, the image below suggests that babies are induced on December 27-30 for a tax break. How likely are people to be born on different birth dates? Like this: Like Loading...
Web Scraping Google Scholar: Part 2 (Complete Success) get_google_scholar_df <- function(u) { html <- getURL(u) doc <- htmlParse(html) GS_xpathSApply <- function(doc, path, FUN) { path.base <- "/html/body/div[@class='gs_r']" nodes.len <- length(xpathSApply(doc, "/html/body/div[@class='gs_r']")) paths <- sapply(1:nodes.len, function(i) gsub( "/html/body/div[@class='gs_r']", paste("/html/body/div[@class='gs_r'][", i, "]", sep = ""), path, fixed = TRUE)) xx <- sapply(paths, function(xpath) xpathSApply(doc, xpath, FUN), USE.NAMES = FALSE) xx[sapply(xx, length)<1] <- NA xx <- as.vector(unlist(xx)) return(xx) df <- data.frame( footer = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']", xmlValue), title = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3", xmlValue), type = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3/span", xmlValue), publication = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_a']", xmlValue), stringsAsFactors = FALSE) df <- df[,-1]
Duncan Williamson: box and whisker diagrams: getting Microsoft Excel to plot them for you boxplot.html Box and Whisker Diagrams: The Microsoft Excel 2003 Solution Introduction See also pages two and three in this series for more up to date versions of this page: look at the menu on the left. The updated versions are also explained in fulll in Chapter Seven of my Excel Book: The Excel Project ... Kindle Version Paperback Version This page takes us a stage further down the road of analysing the dispersion or variability of data sets. We used the standard deviation as a measure that allowed us to compare data sets and say that if one data set had a standard deviation of, say, 3 and the other data set had a standard deviation of, say, 1.2, then we could conclude that the first data set was more disperse than the second data set. The problem with the standard deviation, however, is that many people find it an abstract idea and because of that they find it difficult both to calculate and interpret. Let x1, x2, … xn be a set of n measurements … arranged in increasing (or decreasing) order. Your Turn
untitled Abstract The idea here is to provide simple examples of how to get started with processing XML in R using some reasonably straightforward "flat" XML files and not worrying about efficiency. Here is an example of a simple file in XML containing grades for students for three different tests. <?xml version="1.0" ? We might want to turn this into a data frame in R with a row for each student and four variables, the name and the scores on the three tests. Since this is a small file, let's not worry about efficiency in any way. doc = xmlRoot(xmlTreeParse("generic_file.xml")) We use xmlRoot() to get the top-level node of the tree rather than holding onto the general document information since we won't need it. Since the structure of this file is just a list of elements under the root node, we need only process each of those nodes and turn them into something we want. So a function to do the initial processing of an individual <GRADES> node might be function(node) xmlSApply(node, xmlValue)
Statistics vs Data Science vs BI As someone who trained as a statistician, I've always struggled with that title. I love the rigor and insight that Statistics brings to data analysis, but let's face it: Statistics — the name — has always had a bit of a branding problem. Telling someone I was a statistician was more likely to conjure up images of me counting runs at a baseball (or cricket) game than pursuing serious science. And the image of what Statistics ideally is about — collaborative, interactive, applied, fun — was too often subsumed by the stereotype image — isolated, actuarial, ivory tower, report driven. (And hey, even actuaries can be fun sometimes.) That's why I'm a fan of the term "data scientist" — it embodies everything that Statistics always should be, without the baggage and tradition of the term "statistician". On the other hand, I have no qualms about making a competitive comparison between Data Science and Business Intelligence: Kalido: Data Scientist: Your Must-Have Business Investment NOW
A Short Introduction to the XML package for R To parse an XML document, you can use xmlInternalTreeParse() or xmlTreeParse() (with useInternalNodes specified as TRUE or FALSE) or xmlEventParse() . If you are dealing with HTML content which is frequently malformed (i.e. nodes not terminated, attributes not quoted, etc.), you can use htmlTreeParse() . You can give these functions the name of a file, a URL (HTTP or FTP) or XML text that you have previously created or read from a file. If you are working with small to moderately sized XML files, it is easiest to use xmlInternalTreeParse() to first read the XML tree into memory. #" doc = xmlInternalTreeParse("Install/Web/index.html.in") Then you can traverse the tree looking for the information you want and putting it into different forms. Many people find recursion confusing, and when coupled with the need for non-local variables and mutable state, a different approach can be welcome. src = xpathApply(doc, "//a[@href]", xmlGetAttr, "href") or
Memory Management in the the XML Package Memory Management in the the XML Package Duncan Temple Lang University of California at Davis Department of Statistics Abstract We describe some of the complexities of memory management in the XML package. In R, garbage collection just works as the user expects - when an object is no longer referenced, it is available to be cleaned up and the memory reused. Another more complex situation is when we have two or more R objects that reference a shared value, e.g. R's computational model is that when this happens, the assignment to x and y causes the right hand side (the value of a) to be copied. How does this relate to the XML package ? doc = xmlParse("<foo><bar>Some text</bar></foo>", asText = TRUE) Then we use some mechanism (e.g. nodes = getNodeSet(doc, "//bar") Now, if we remove doc or simply return nodes from a function that has doc as a local variable, we have a problem. Now, while we have references to the individual nodes, all is well. and the call There is a yet more complicated issue.
The XML package. It's crantastic! Tools for parsing and generating XML within R and S-Plus.. This package provides many approaches for both reading and creating XML (and HTML) documents (including DTDs), both local and accessible via HTTP or FTP. It also offers access to an XPath "interpreter". Maintainer: Duncan Temple Lang Author(s): Duncan Temple Lang (duncan@r-project.org) License: BSD Released 10 months ago. 28 previous versions XML_3.96-1.1. Ratings Log in to vote. Reviews No one has written a review of XML yet. Related packages: … (20 best matches, based on common tags.) Search for XML on google, google scholar, r-help, r-devel. Visit XML on R Graphical Manual.
The Omega Project for Statistical Computing Comment faire pour transformer les données XML dans un data.frame? | TecHerald.com J'essaie d'apprendre R XML paquet. J'essaie de créer un data.frame échantillon books.xml fichier de données XML. C'est ce que j'obtiens: library(XML) books <- " doc <- xmlTreeParse(books, useInternalNodes = TRUE) doc xpathApply(doc, "//book", function(x) do.call(paste, as.list(xmlValue(x)))) xpathSApply(doc, "//book", function(x) strsplit(xmlValue(x), " ")) xpathSApply(doc, "//book/child::*", xmlValue) Chacun d'eux est xpathSApply moi même pas proche de mon intention. Membres Shane a dit: En général, je vous suggère d'essayer le xmlToDataFrame() la fonction, mais je pense que ce sera assez difficile car il n'est pas bien structuré au départ. Je recommande de travailler avec cette fonction: xmlToList(books) Un des problèmes est qu'il ya un certain nombre d'auteurs par livre, donc vous devez décider comment gérer cela quand vous êtes la structuration de votre trame de données. Voici un exemple complet (à l'exclusion de l'auteur):
Web Scraping Google Scholar: Part 2 (Complete Success) « Consistently Infrequent library(RCurl) library(XML) get_google_scholar_df <- function(u) { html <- getURL(u) doc <- htmlParse(html) GS_xpathSApply <- function(doc, path, FUN) { path.base <- "/html/body/div[@class='gs_r']" nodes.len <- length(xpathSApply(doc, "/html/body/div[@class='gs_r']")) paths <- sapply(1:nodes.len, function(i) gsub( "/html/body/div[@class='gs_r']", paste("/html/body/div[@class='gs_r'][", i, "]", sep = ""), path, fixed = TRUE)) xx <- sapply(paths, function(xpath) xpathSApply(doc, xpath, FUN), USE.NAMES = FALSE) xx[sapply(xx, length)<1] <- NA xx <- as.vector(unlist(xx)) return(xx) df <- data.frame( footer = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']", xmlValue), title = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3", xmlValue), type = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3/span", xmlValue), publication = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_a']", xmlValue), df <- df[,-1]