5 lessons we learned about data science in 2013 - VentureBeat How can big data and smart analytics tools ignite growth for your company? Find out at DataBeat, May 19-20 in San Francisco, from top data scientists, analysts, investors, and entrepreneurs. Register now and save $200! Most people know what marketing executives do every day. They try to catch people’s attention through email, ads, tweets, and press releases. As for data scientists, well, their work is not nearly as well understood. That’s been slowly changing this year as companies slowly loosen up about letting their hard-won data scientists talk about their work. This year, VentureBeat has learned a lot about these fawned-over specimens. Data scientists should be creative This point became clear as Jeremy Howard, the former president of data science competition-holder Kaggle, spoke with fellow luminaries in the field at VentureBeat’s 2013 DataBeat/Data Science Summit event a few weeks ago. Choose a business problem and then the tools, not the other way around What’s coming in 2014
Cours Data Mining - Data Science, Big Data Analytics Contenu et objectifs du cours DATA MINING - DATA SCIENCE Data Mining Le DATA MINING , raccourci de "Extraction de Connaissances à partir de Données" ("Knowledge Discovery in Databases" en anglais - KDD), est un domaine très en vogue. A la lecture des différents documents essayant tant bien que mal de définir exactement ce qu'est le data mining, on peut se dire que, finalement, cela fait plus de 30 ans qu'on le pratique avec ce qu'on appelle l'analyse de données et les statistiques exploratoires. Et on n'aurait pas complètement tort. En réalité, ce n'est pas aussi simple, le data mining emmène plusieurs points nouveaux qui sont loin d'être négligeables : (1) des techniques d'analyse qui ne sont pas dans la culture des statisticiens, en provenance de l'apprentissage automatique (Intelligence artificielle), de la reconnaissance de formes (pattern recognition) et des bases de données ; (2) l'extraction de connaissances est intégrée dans le schéma organisationnel de l'entreprise. Public visé
Migration Policy Institute | migrationpolicy.org The Data Engineering Ecosystem: An Interactive Map David Drummond and John Joo March 6, 2015 David Drummond Insight Data Engineering Program Director John Joo Insight Data Engineering and Data Science Program Director Companies, non-profit organizations, and governments are all starting to realize the huge value that data can provide to customers, decision makers, and concerned citizens. What is often neglected is the amount of engineering required to make that data accessible. Simply using SQL is no longer an option for large, unstructured, or real-time data. Building a system that makes data usable becomes a monumental challenge for data engineers. There is no plug and play solution that solves every use case. Insight Data Engineering Fellows face these same questions when they begin working on their data pipelines. Of course, there are more tools than we can possibly cover in a single chart, and many of them cannot be strictly categorized.
Learning the meaning behind words Today computers aren't very good at understanding human language, and that forces people to do a lot of the heavy lifting—for example, speaking "searchese" to find information online, or slogging through lengthy forms to book a trip. Computers should understand natural language better, so people can interact with them more easily and get on with the interesting parts of life. While state-of-the-art technology is still a ways from this goal, we’re making significant progress using the latest machine learning and natural language processing techniques. Deep learning has markedly improved speech recognition and image classification. For example, we’ve shown that computers can learn to recognize cats (and many other objects) just by observing large amount of images, without being trained explicitly on what a cat looks like. Now we apply neural networks to understanding words by having them “read” vast quantities of text on the web.
MapReduce and Spark | Cloudera VISION About a week ago, I posted an article on Cloudera’s strategy on SQL in the Apache Hadoop ecosystem. In the article, I argued that a special-purpose distributed query processing engine will perform better than one that translates work into a general-purpose MapReduce framework, even if MapReduce is improved to trim latency and improve throughput. Notwithstanding that bet, we simultaneously believe that the ecosystem needs a high-performance alternative to the current MapReduce implementation. In this piece, I want to walk through the history, the current status and the short- and long-term future of the Hadoop platform, concentrating especially on MapReduce. Where We Came From The earliest instance of the architecture at Google combined flexible, scalable storage with a single processing framework — MapReduce — to handle a wide variety of processing and analytic workloads. Where We Are Enter Spark The leading candidate for “successor to MapReduce” today is Apache Spark. The Near Future
Supports de cours -- Data Mining et Data Science Cette page recense les supports utilisés pour mes enseignements de Machine Learning, Data Mining et de Data Science au sein du Département Informatique et Statistique (DIS) de l'Université Lyon 2, principalement en Master 2 Statistique et Informatique pour la Science des donnéEs (SISE), formation en data science, dans le cadre du traitement statistique des données et de la valorisation des big data. Je suis très attentif à la synergie forte entre l'informatique et les statistiques dans ce diplôme, ce sont là les piliers essentiels du métier de data scientist. Attention, pour la majorité, il s'agit de « slides » imprimés en PDF, donc très peu formalisés, ils mettent avant tout l'accent sur le fil directeur du domaine étudié et recensent les points importants. Cette page est bien entendu ouverte à tous les statisticiens, data miner et data scientist, étudiants ou pas, de l'Université Lyon 2 ou d'ailleurs. Nous vous remercions par avance. Ricco Rakotomalala – Université Lyon 2
mikeaddison93/social-engineer-toolkit Active learning, almost black magic | Larsblog I've written Duke, an engine for figuring out which records represent the same thing. It works fine, but people find it difficult to configure correctly, which is not so strange. Getting the configurations right requires estimating probabilities and choosing between comparators like Levenshtein, Jaro-Winkler, and Dice coefficient. Can we get the computer to do something people cannot? I implemented a genetic algorithm that can set up a good configuration automatically. But that leaves us with a bootstrapping problem. So, what to do? Then I came across a paper where Axel Ngonga described how to solve this problem with active learning. How to pick those pairs? What's fascinating is that this almost ridiculously simple solution actually works. To make this a little more concrete, let's look at some real-world examples. Linking countries Let's start with the linking countries example. This example is almost too easy, though, so let's try something a little harder. Cityhotels.com