ELC 015: Don't Fear The Data Sharebar Podcast: Play in new window | Download Subscribe in iTunes Collecting, analyzing and using big data has become embedded in how governments, corporations and institutions work. It’s now an expected, though often resented, part of our culture. In this episode, I interview Ellen Wagner Ph.D., who helps us understand big data and how it can be leveraged to improve learning and development as well as higher education. Ellen is Partner and Senior Analyst for Sage Road Solutions. What big data is and isn’tWhy the learning industry should be paying attention to big dataDecisions we can we make with big dataProblems we can preventUsing data to get follow and get ahead of the trendsWhy we push back on measurement and evaluation of our workThe benefits of being (partially) data drivenPatterns that people are finding in learning analyticsHow to cooperate and share de-identified dataDifference between inferential and predictive statisticsExperience API TIME: 30 minutes
How to Become a Data Scientist These days you can get a degree in data science so you can show your diploma that certifies your credentials. But these are relatively new so, with all due respect, if you only recently got your degree you are still a beginner. Those of us who use this title today most likely came from combination backgrounds of business, hard science, computer science, operations research, and statistics. What you call yourself is one thing but what your employer or client is looking for can be quite a different kettle of fish. A lot has been written about data scientists being as elusive as unicorns. Not being a unicorn I’d say this sets the bar pretty high. All of this confusion over what we’re called and what we actually do can make you down right schizophrenic. Four Types of Data Scientists The information here comes from the O’Reilly paper “Analyzing the Analyzers” by Harris, Murphy, and Vaisman, 2013. There are 40 pages of good analysis here so this will be only the highest level summary.
Small Pond Science | Research in a teaching institution District Data Labs - How to Transition from Excel to R How to Transition from Excel to R An Intro to R for Microsoft Excel Users Tony Ojeda In today's increasingly data-driven world, business people are constantly talking about how they want more powerful and flexible analytical tools, but are usually intimidated by the programming knowledge these tools require and the learning curve they must overcome just to be able to reproduce what they already know how to do in the programs they've become accustomed to using. If you're an Excel user and you're scared of diving into R, you're in luck. Excited? Quick note before we do: There are usually multiple ways to do everything in R. The Basics Let's start with the basics. You'll also want to install and load the ggplot2 library, which not only contains the data set we want to use but will also come in handy when we get to creating charts and graphs later. install.packages("ggplot2") install.packages("dplyr") library(ggplot2)library(dplyr) OK, so let's take an initial look at the data. Summaries
Big Data Timeline Interesting interactive timeline featuring a number of "big data" milestones since 1932. There's way too much emphasis on BI, ERP and SAP, but still, it contains lots of interesting history when you filter out these references. Big data, back in 1940 Here are some highlights: July 1997 - The Problem of Big Data. Check the presentation. Other links
Мифы и легенды про Big Data / Блог компании ВымпелКом (Билайн) Один из наших кластеров для пилотных задач (Data node: 18 servers /2 CPUs, 12 Cores, 64GB RAM/, 12 Disks, 3 TB, SATA — HP DL380g) — Что такое Big Data вообще? Все знают, что это обработка огромных массивов данных. Но, например, работа с Oracle-базой на 20 Гигабайт или 4 Петабайта — это ещё не Big Data, это просто highload-БД. — Так в чём ключевое отличие Big Data от «обычных» highload-систем? В возможности строить гибкие запросы. — Откуда берётся эта новая нагрузка? — Есть пример такой задачи? — И как это решается? — Так давайте просто промасштабируем их — и проблема решится? — Так что получается в итоге? — Но ведь это чудовищно медленно, разве не так? Короткие запросы с малым количеством join’ов. — Какие есть известные примеры использования Big Data? — А почему тогда все на конференциях говорят про Big Data? — Получается, что одна из целей Big Data — возможность уйти от долгих проектных циклов? — Есть примеры уже решенных задач, где это было видно? — Какова структура платформы?
The Tree of Life refsmmat Data Science Dictionary We created a data science dictionary in 2012, and we are still adding keywords. It is also in our Wiley book (better English, recent update). Here we share with you another similar dictionary, from BigDataProjects.org. Here are the first few enties, from the Techniques sections (there are two sections: techniques and technologies): A/B testing: A technique in which a control group is compared with a variety of test groups in order to determine what treatments (i.e., changes) will improve a given objective variable, e.g., marketing response rate. Big data enables huge numbers of tests to be executed and analyzed, ensuring that groups are of sufficient size to detect meaningful (i.e., statistically significant) differences between the control and treatment groups When more than one variable is simultaneously manipulated in the treatment, the multivariate generalization of this technique, which applies statistical modeling, is often called “A/B/N” testing. Association rule learning:
Data Science Cheat Sheet I will update this article regularly. An old version can be found here and has many interesting links. All the material presented here is not in the old version. 1. A laptop is the ideal device. Even if you work heavily on the cloud (AWS, or in my case, access to a few remote servers mostly to store data, receive data from clients and backups), your laptop is you core device to connect to all external services (via the Internet). 2. Once you installed Cygwin, you can type commands or execute programs in the Cygwin console. Figure 1: Cygwin (Linux) console on Windows laptop You can open multuple Cygwin windows on your screen(s). To connect to an external server for file transfers, I use the Windows FileZilla freeware rather than the command-line ftp offered by Cygwin. You can run commands in the background using the & operator. $ notepad VR3.txt & A few more things about files Other extensions include Files are not stored exactly the same way in Windows and UNIX. File management 3. Examples
Dynamic Ecology | Multa novit vulpes