Use H2O and data.table to build models on large data sets in R. Introduction Last week, I wrote an introductory article on the package data.table. It was intended to provide you a head start and become familiar with its unique and short syntax. The next obvious step is to focus on modeling, which we will do in this post today.
With data.table, you no longer need to worry about your machine cores’ (or RAM to some extent). Atleast, I used to think of myself as a crippled R user when faced with large data sets. Last week, I received an email saying: Okay, I get it. data.table empowers us to do data exploration & manipulation. I’m sure there are many R users who are trapped in a similar situation. For practical understanding, I’ve taken the data set from a previously held competition and tried to improve the score using 4 different machine learning algorithms (with H2O) & feature engineering (with data.table).
Table of Contents What is H2o ? Note: Consider this article as a starters guide for model building using data.table and H2O. What is H2O ? 1. 2. 3. PMML and Hadoop. Reference to Hadoop implies huge amount of data. The intend of the data is of course to derive insights that will help businesses stay competitive. "Scoring" the data is a common exercise in determining e.g. customer churn, fraud detection, risk mitigation, etc... It is one of the slowest analytics activities and especially when very large data set is involved. There are various fast scoring products in the market but they are very specialized and/or are provided by one vendor, usually requiring the entire scoring process to be done using its tools set.
This poses a problem for those who build their scoring model using tools other than that of the scoring engine vendor. There is a democratic way of doing scoring. Hive makes it possible for large datasets stored in Hadoop compatible systems to be easily analyzed. Once deployed in UPPI, predictive models expressed in PMML turn into SQL Functions. Stinger and Tez: A primer -Big Data Analytics News. What is Stinger? The Stinger initiative aims to redesign Hive to make it what people want today: Hive is currently used for large batch jobs and works great in that sense; but people also want interactive queries, and Hive is too slow today for this. So a big drive is performance, aiming to be 100x faster. Additionally Hive doesn’t support all of the native SQL features for analytics (e.g. Date calculations, Windowing) so to be able to connect to a BI stack, and give Hive to SQL power users. Stinger also covers work that is in HDFS and YARN. This is work done as open source, by Hortonworks, but also SAP, Microsoft, and Facebook.
Why build upon Hive rather than build a new system? Hive is out there and being used, and works at scale. Hive already manages at scale, and scale is hard. Also from a vision perspective, users only need one SQL tool, not a fragmented toolkit. Why is SQL compatibility so important? SQL is the English of the data world. Stinger is not only focusing on SQL. Apache Hadoop YARN: Present and Future. Hadoop Tutorial: Intro to HDFS. Upgrading Hortonworks HDP from 2.0 to 2.1 | Arne Weitzel. Today I have upgraded my personal HDP cluster from version 2.0 to version 2.1. The cluster runs completely on a CentOS 6 VM on my notebook, so it just consists of one single node hosting the namenode, datanode and all other services.
Beside this I have a second Linux VM hosting SAP Data Services 4.2 with connectivity to the HDP cluster. The HDP installation on that machine can be considered as a kind of Hadoop client. The HDP software on that VM needed to be upgrdaded as well. Given these resourecs I am using these VMs just for evaluating Hadoop functionalities. For those people who are in a similar situation as me and who want to upgrade a test HDP 2.0 cluster I’d like to share my experiences with this upgrade: Hortonworks has documented two upgrade approaches: I was using the second option with Ambari. If you follow thorouhly all the documented steps everything should work fine. Section 1.10: Upgrading Ambari Server to 1.6.1 –> Upgrade the Nagios add-ons package Post Upgrade Issues. Connecting SAP DataServcies to Hadoop: HDFS vs.... SAP DataServices (DS) supports two ways of accessing data on a Hadoop cluster: HDFS: DS reads directly HDFS files from Hadoop.
In DataServices you need to create HDFS file formats in order to use this setup. Depending on your dataflow DataServices might read the HDFS file directly into the DS engine and then handle the further processing in its own engine. If your dataflow contains more logic that could be pushed down to Hadoop, DS may as well generate a Pig script. The Pig script will then not just read the HDFS file but also handle other transformations, aggregations etc. from your dataflow.The latter scenario is usually a preferred setup for large amount of data because this way the Hadoop cluster can provide processing power of many Hadoop nodes on inexpensive commodity hardware.
The pushdown of dataflow logic to Pig/Hadoop is similar to the pushdown to relational database systems. It is difficult to say which approach better to use in DataServices: HDFS files/Pig or Hive? Connecting SAP DataServices to Hadoop Hive. Connecting SAP DataServices to Hadoop Hive is not as simple as connecting to a relational database for example. In this post I want to share my experiences on how to connect DataServices (DS) to Hive.
The DS engine cannot connect to Hive directly. Instead you need to configure a Hive adapter from the DS management console which will actually manage the connection to Hive. In the rest of this post I will assume the following setup: DS is not installed on a node in the Hadoop cluster, but has access to the Hadoop cluster. The DS server should run on a Unix server. I think that such a setup is a typical in most cases.The Hadoop cluster consists of a Hortonworks Data Platform (HDP) 2.x distribution. The DS server will not be installed within the Hadoop cluster, but it will have access to it. Roughly, there are two approaches for installing Hadoop on the DS server: The DS jobserver needs some Hadoop environment settings. Test hive connection Configure the Hive adapter. 3.2 Setup of DS datastore.
DataServices Text Analysis and Hadoop - the Det... I have already used the text analysis feature within SAP DataServices in various projects (the transform in DataServices is called Text Data Processing or TDP in short). Usually, the TDP transform runs in the DataServices engine, means that DataServices first loads the source text in its own memory and then runs the text analysis on its own server / engines. The text sources are usually unstructured text or binary files such as Word, Excel, PDF files etc. If these files reside on a Hadoop cluster as HDFS files, DataServices can also push down the TDP transform as a MapReduce job to the Hadoop cluster. Running the text analysis within Hadoop (means as MapReduce jobs) can be an appealing approach if the total volume of source files is big and at the same time the Hadoop cluster has enough resources. Quality assurance of the text analysis process: doing manual spot checks by reviewing the text source.
Anyway, in all these scenarios the text documents are only used as a source. CSV files: Business Intelligence: An Integration of Apache. Last week was very busy for attendees at SAP TechEd Las Vegas, so fortunately SAP has made some recordings of sessions available. I watched An Integration of Apache Hadoop, SAP HANA, and SAP BusinessObjects, session EA204 today with SAP's Anthony Waite. Text Analysis came up a few times last week plus I am familiar with Data Services Text Analysis features. First, a review of how text mining fits in. Figure 1: Source: SAP Figure 1 shows we have "lots of unstructured data", with 80% unstructured. Unstructured gets messy; think of a MS Doc file – can you run that through your system/process?
The hot topic is social networks and analyzing for sentiment analysis. Customer preferences can be mined as an example. Figure 2: Source: SAP Anthony said you typically don’t go into your BI tool and search for unstructured information. It is challenging; it is intensive to process and analyze. Figure 3: Source: SAP HDFS – Hadoop Distribution File System are the essence Figure 4: Source: SAP.
Don't use Hadoop - your data isn't that big - Chris Stucchio. "So, how much experience do you have with Big Data and Hadoop? " they asked me. I told them that I use Hadoop all the time, but rarely for jobs larger than a few TB. I'm basically a big data neophite - I know the concepts, I've written code, but never at scale. The next question they asked me. "Could you use Hadoop to do a simple group by and sum?
" They handed me a flash drive with all 600MB of their data on it (not a sample, everything). Hadoop is limiting. Scala-ish pseudocode: collection.flatMap( (k,v) => F(k,v) ).groupBy( _._1 ).map( _.reduce( (k,v) => G(k,v) ) ) SQL-ish pseudocode: SELECT G(...) Or, as I explained a couple of years ago: Goal: count the number of books in the library.Map: You count up the odd-numbered shelves, I count up the even numbered shelves. The only thing you are permitted to touch is F(k,v) and G(k,v), except of course for performance optimizations (usually not the fun kind!) But my data is hundreds of megabytes! Too big for Excel is not "Big Data". P.S. To Hadoop or Not to Hadoop? Hadoop is very popular, but is not a solution for all Big Data cases. Here are the questions to ask to determine if Hadoop is right for your problem.
Guest blog By Anand Krishnaswamy, ThoughtWorks, Oct 4, 2013. Hadoop is often positioned as the one framework your organization needs to solve nearly all your problems. Mention "Big Data" or "Analytics" and pat comes the reply: Hadoop! Hadoop, however, was purpose-built for a clear set of problems; for others it is, at best, a poor fit or, even worse, a mistake. While data transformation (or, broadly, ETL operations) benefit significantly from a Hadoop setup, if your organization needs fall into any of the following categories, Hadoop might be a misfit. 1.
While many businesses like to believe that they have a Big Data dataset, it is often not the case. Ask Yourself: Do I have several terabytes of data or more? 2. When submitting jobs, Hadoop's minimum latency is about a minute. What are user expectations around response time? 3. 4. 5. Busting 10 myths about Hadoop. Although Hadoop and related technologies have been with us for more than five years now, most BI professionals and their business counterparts still harbor a few misconceptions that need to be corrected about Hadoop and related technologies such as MapReduce.
The following list of 10 facts will clarify what Hadoop is and does relative to BI/DW, as well as in which business and technology situations Hadoop-based business intelligence (BI), data warehousing (DW), data integration (DI), and analytics can be useful. Fact No. 1: Hadoop consists of multiple products We talk about Hadoop as if it's one monolithic thing, but it's actually a family of open-source products and technologies overseen by the Apache Software Foundation (ASF).
(Some Hadoop products are also available via vendor distributions; more on that later.) The Apache Hadoop library includes (in BI priority order): the Hadoop Distributed File System (HDFS), MapReduce, Pig, Hive, HBase, HCatalog, Ambari, Mahout, Flume, and so on. Ad-hoc query on Hadoop (TDWI_Infobrigth) GettingStarted - Apache Hive. Table of Contents Installation and Configuration You can install a stable release of Hive by downloading a tarball, or you can download the source code and build Hive from that.
Requirements Java 1.7Note: Hive versions 1.2 onward require Java 1.7 or newer. Hive versions 0.14 to 1.1 work with Java 1.6 as well. Installing Hive from a Stable Release Start by downloading the most recent stable release of Hive from one of the Apache download mirrors (see Hive Releases). Next you need to unpack the tarball. . $ tar -xzvf hive-x.y.z.tar.gz Set the environment variable HIVE_HOME to point to the installation directory: $ cd hive-x.y.z $ export HIVE_HOME={{pwd}} Finally, add $HIVE_HOME/bin to your PATH: $ export PATH=$HIVE_HOME/bin:$PATH Building Hive from Source The Hive GIT repository for the most recent Hive code is located here: git clone (the master branch).
As of 0.13, Hive is built using Apache Maven. Compile Hive on master Compile Hive on branch-1 or: