background preloader

Storm, distributed and fault-tolerant realtime computation

Storm, distributed and fault-tolerant realtime computation

So, you want to build a recommendation engine? At PredictiveIntent, we had a lot of enquiries from people at companies who were not sure whether to build their own recommendation engine, plug in a lightweight recommendations solution, or dedicate some time to implementing “personalisation” properly. Our advice usually consists of three main points: Focus on your goals – will spending too much time building a recommendation engine take your development cycle off track? The importance of technology – thowing a few lines of Javascript code on a side and manually uploading datafeeds might be sufficient for the time being, but it will restrict you from innovating with recommendations? Don’t underestimate performance – can you support a 99.95% uptime with multiple redundancy systems, 60 millisecond response times, peak loads of >100 transactions per second, and more? However, there are many different variations that fall into two main camps: Recommendations and Personalisation. Recommendations And for most, that’s all. Personalisation

nathanmarz/storm Building a recommendation engine, foursquare style Mar 22nd Last summer, foursquare’s employee count had grown a bit beyond our office capacity (as we surged towards 20 employees) and we had people sitting in whatever open space we could find. We were split between floors, parked on folding tables, and crammed into couches and loveseats. In one of those seats, @anoopr was playing around with building a map showing interesting places, which we called “Explore.” After that initial discussion, we quickly set up an API endpoint for Explore and started adding and tweaking features. With the results we were seeing, we could already sense that Explore was going to become something awesome. Our mobile web test client At this point, it was time to build in some personalization into the algorithm. One of the hardest parts of building this was determining what the algorithm should do. While we’re keeping the new “cold start” algorithm as part of our secret sauce, we wanted to give you a closer look into the data that fed the ranking. What’s next?

Introducing Cascalog: a Clojure-based query language for Hadoop I'm very excited to be releasing Cascalog as open-source today. Cascalog is a Clojure-based query language for Hadoop inspired by Datalog. Highlights Simple - Functions, filters, and aggregators all use the same syntax. OK, let's jump into Cascalog and see what it's all about! Basic queries First, let's start the REPL and load the playground: lein repluser=> (use 'cascalog.playground) (bootstrap) This will import everything we need to run the examples. user=> (? This query can be read as "Find all ? OK, let's try something more involved. user=> (? That's pretty simple too. Let's run that query again but this time include the ages of the people in the results: user=> (? All we had to do was add the ? Let's do another query and find all the male people that Emily follows: user=> (? You may not have noticed, but there's actually a join happening in this query. Structure of a query Let's look at the structure of a query in more detail. user=> (? The query operator we've been using is ? (age ? (< ? (* 4 ?

JAGS - Just Another Gibbs Sampler Database Access with Hadoop Editor’s note (added Nov. 9. 2013): Valuable data in an organization is often stored in relational database systems. To access that data, you could use external APIs as detailed in this blog post below, or you could use Apache Sqoop, an open source tool (packaged inside CDH) that allows users to import data from a relational database into Apache Hadoop for further processing. Sqoop can also export those results back to the database for consumption by other clients. Apache Hadoop’s strength is that it enables ad-hoc analysis of unstructured or semi-structured data. This blog post explains how the DBInputFormat works and provides an example of using DBInputFormat to import data into HDFS. DBInputFormat and JDBC First we’ll cover how DBInputFormat interacts with databases. Reading Tables with DBInputFormat The DBInputFormat is an InputFormat class that allows you to read data from a database. Configuring the job To use the DBInputFormat, you’ll need to configure your job. Retrieving the data

Doubly linked list A doubly-linked list whose nodes contain three fields: an integer value, the link to the next node, and the link to the previous node. The two node links allow traversal of the list in either direction. While adding or removing a node in a doubly-linked list requires changing more links than the same operations on a singly linked list, the operations are simpler and potentially more efficient (for nodes other than first nodes) because there is no need to keep track of the previous node during traversal or no need to traverse the list to find the previous node, so that its link can be modified. Nomenclature and implementation[edit] Basic algorithms[edit] Open doubly-linked lists[edit] record DoublyLinkedNode { prev // A reference to the previous node next // A reference to the next node data // Data or a reference to data } record DoublyLinkedList { DoublyLinkedNode firstNode // points to first node of list DoublyLinkedNode lastNode // points to last node of list } Traversing the list[edit]

MapReduce Patterns, Algorithms, and Use Cases « Highly Scalable Blog In this article I digested a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be found on the web or scientific articles. Several practical case studies are also provided. All descriptions and code snippets use the standard Hadoop’s MapReduce model with Mappers, Reduces, Combiners, Partitioners, and sorting. MapReduce Framework Counting and Summing Problem Statement: There is a number of documents where each document is a set of terms. Solution: Let start with something really simple. The obvious disadvantage of this approach is a high amount of dummy counters emitted by the Mapper. In order to accumulate counters not only for one document, but for all documents processed by one Mapper node, it is possible to leverage Combiners: Applications: Log Analysis, Data Querying Collating Problem Statement: There is a set of items and some function of one item. The solution is straightforward. Inverted Indexes, ETL Distributed Task Execution

Related: