background preloader

The Endeavour — The blog of John D. Cook

The Endeavour — The blog of John D. Cook
I help people make decisions in the face of uncertainty. Sounds interesting. I’m a data scientist. Not sure what that means, but it sounds cool. I study machine learning. I’m into big data. Even though each of these descriptions makes a different impression, they’re all essentially the same thing. There are distinctions. “Decision-making under uncertainty” emphasizes that you never have complete data, and yet you need to make decisions anyway. “Data science” stresses that there is more to the process of making inferences than what falls under the traditional heading of “statistics.” Despite the hype around the term data science, it’s growing on me. Machine learning, like decision theory, emphasizes the ultimate goal of doing something with data rather than creating an accurate model of the process that generates the data. “Big data” is a big can of worms. Bayesian statistics is much older than what is now sometimes called “classical” statistics.

Fishing in the Bay » Blog Archive » Why I am in favour of logging A colleague recently brought to me some alternative fits he had done for a paper he was writing. The alternative fits looked very strange but had been strongly suggested by a referee. He was fitting a regression model to inter-country trade data and trying to explain patterns in terms of various measures of cultural fit. The referee was pointing to some papers in econometrics that had argued about the relative merits of multiplicative regression models fitted on the direct scale, rather than on the log-scale. The referee wanted a direct fit on the basis that the random errors may be more normal and additive on the direct scale. One of the papers he was pointing to is HERE which contains the unequivocal recommendation Overall, except under very special circumstances, estimation based on the log-linear model cannot be recommended. Sounds like complete bollocks to me. Why is the log-transform better? Leverage effects can be huge on the direct scale. So there you have it.

Math ∩ Programming | A place for elegant solutions Social Science Statistics Blog 28 April 2013 App Stats: Roberts, Stewart, and Tingley on "Topic models for open ended survey responses with applications to experiments" We hope you can join us this Wednesday, May 1, 2013 for the Applied Statistics Workshop. Molly Roberts, Brandon Stewart, and Dustin Tingley, all from the Department of Government at Harvard University, will give a presentation entitled "Topic models for open ended survey responses with applications to experiments". A light lunch will be served at 12 pm and the talk will begin at 12.15. "Topic models for open ended survey responses with applications to experiments" Molly Roberts, Brandon Stewart, and Dustin Tingley Government Department, Harvard University CGIS K354 (1737 Cambridge St.) Abstract: Despite broad use of surveys and survey experiments by political science, the vast majority of survey analysis deals with responses to options along a scale or from pre-established categories. Posted by Konstantin Kashin at 11:25 PM | Comments (2) 22 April 2013

Philadelphia Software Developer “Postgres for Developers” – Notes from PGConf NYC 2014 April 8th, 2014 — Code Examples I saw a talk by one of the core Postgres developers, which showed a bunch of interesting tricks to handle business rules in Postgres specific SQL. Example 1: Array Aggregation “array_agg” can be used to combine rows, which sort of resembles a pivot table operation (this is the same set of values that would be passed as arguments to other aggregation functions) If you use the above table as a common table expression, you can also rename the columns in the with block. Example 2: Named Window Functions I’m not sure yet whether this is just syntactic sugar or has real value, but you can set up named “windows.” By way of explanation, a lot of times when you start using aggregate functions (min, max, array_agg, etc), you end up using window functions, which resemble the following: These allow you do calculate aggregate functions (like min/max) without combining all the rows. PGConfNYC Keynote Notes (Gilt)

R-statistics blog Statistics for a changing world: Google Public Data Explorer in Labs Last year, we released a public data search feature that enables people to quickly find useful statistics in search. More recently, we expanded this service to include information from the World Bank, such as population data for every region in the world. More and more public agencies, non-profits and other organizations are looking for ways to open up their data and expand global access to this kind of information. We want to help keep that momentum going, so today we're sharing a snapshot of some of the most popular public data search topics on Google. We're also launching the Google Public Data Explorer, an experimental visualization tool in Google Labs. Popular public data topics on GoogleWe know people want to be able to find reliable data and statistics on a variety of subjects. You can read the complete list at this link (PDF), but here's the top 20 to get you started: You'll notice some interesting entries in the list. Animated charts can bring data to life.

Understanding Shakespeare / Approaches A guide to querying 'references' in the Content API | Open Platform We have recently extended the ways that you can search our Content API to include queries with 'references'. You can query the API with an ISBN number, and see articles about the corresponding book, or by a MusicBrainz ID, and see articles about the artist or composer. Here are some answers to frequently asked questions about this feature. Questions Answers What is the 'show-references' parameter? The show-references= parameter has been added on content search, tag search and item endpoints. show-references=isbn => display ISBN references where available. show-references=musicbrainz,isbn => display MusicBrainz and ISBN references where available. show-references=all => display all available references. What is the 'reference-type' parameter? The reference-type= parameter has been added on content search, tag search and item endpoints. reference-type=musicbrainz,isbn => return content which has both an ISBN and MusicBrainz identifier associated. What is the 'reference=' parameter? No. No. search?

Related: