background preloader

K-Means Clustering - Apache Mahout

K-Means Clustering - Apache Mahout

K Means Clustering with Tf-idf Weights | Blog | Jonathan Zong Unsupervised learning algorithms in machine learning impose structure on unlabeled datasets. In Prof. Andrew Ng's inaugural ml-class from the pre-Coursera days, the first unsupervised learning algorithm introduced was k-means, which I implemented in Octave for programming exercise 7. Now, after the fact but with a fresh perspective and more experience, I will revisit the k-means algorithm in Java to implement text clustering. K-means is an algorithm designed to find coherent groups of data, a.k.a. clusters. Tf-idf Weighting Before being able to run k-means on a set of text documents, the documents have to be represented as mutually comparable vectors. Cosine Similarity Now that we're equipped with a numerical model with which to compare our data, we can represent each document as a vector of terms using a global ordering of each unique term found throughout all of the documents, making sure first to clean the input. k-means

Introducing Apache Mahout Scalable, commercial-friendly machine learning for building intelligent applications Grant IngersollPublished on September 08, 2009 Increasingly, the success of companies and individuals in the information age depends on how quickly and efficiently they turn vast amounts of data into actionable information. Whether it's for processing hundreds or thousands of personal e-mail messages a day or divining user intent from petabytes of weblogs, the need for tools that can organize and enhance data has never been greater. Therein lies the premise and the promise of the field of machine learning and the project this article introduces: Apache Mahout (see Related topics). Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous experiences. After giving a brief overview of machine-learning concepts, I'll introduce you to the Apache Mahout project's features, history, and goals. Machine learning 101 Features

Lecture 6: Collaborative Filtering / Information Extraction Lecture 6: Collaborative Filtering / Information Extraction Tao Yang's Lecture ExpertRank: Ranking system for Ask.com. See US Patent Application 7028026 by Tao Yang, Wei Wang, and Apostolos Gerasoulis. Retrieve documents from inverted file. Cluster documents by content and by link structure Apply a hub/authority analysis to each clusters. Required Reading: Chakrabarti, sec 4.5 Evaluating collaborative filtering recommender systems By Jonathan Herlocker, Joseph Konstan, Loren Terveen, and John Reidl, ACM Transations on Information Systems, vol. 22, No. 1, 2004, pp. 5-53. Unsupervised Named-Entity Extraction from the Web. Additional Reading Amazon.com Recommendations: Item to Item Collaborative Filtering by Greg Linden, Brent Smith and Jeremy York, IEEE Internet Computing January-February 2003. Collaborative Filtering Example: Terms and Documents We say that document D is relevant to query term T if D contains T. Example: Personal preferences General issues in either of these: 1.

Clustering Snippets With Carrot2 | Index Data We’ve been investigating ways we might add result clustering to our metasearch tools. Here’s a short introduction to the topic and to an open source platform for experimenting in this area. Clustering Using a search interface that just takes some keywords often leads to miscommunication. To aid the user in narrowing results to just those applicable to the context they’re thinking about, a good deal of work has been done in the area of “clustering” searches. One common way to represent a document, both for searching and data mining, is the vector space model. This kind of bag-of-words model is very useful for separating documents into groups. Another differentiator among clustering algorithms is when the clustering happens, before or after search. Similarly, we can leverage another part of the search system: snippet generation. Carrot2 Suffix Tree Clustering (STC) is one of the first feasible snippet-based document clustering algorithms, proposed in 1998 by Zamir and Etzioni. Lingo

hunch Part I slides (Powerpoint) Introduction Part II.a slides (Powerpoint) Tree Ensembles Part II.b slides (Powerpoint) Graphical models Part III slides (Summary + GPU learning + Terascale linear learning) This tutorial gives a broad view of modern approaches for scaling up machine learning and data mining methods on parallel/distributed platforms. The tutorial is based on (but not limited to) the material from our upcoming Cambridge U. Presenters Ron Bekkerman is a senior research scientist at LinkedIn where he develops machine learning and data mining algorithms to enhance LinkedIn products. Misha Bilenko is a researcher in Machine Learning and Intelligence group at Microsoft Research, which he joined in 2006 after receiving his PhD from the University of Texas at Austin. John Langford is a senior researcher at Yahoo!

Geeking with Greg Using REST to Invoke the API - Custom Search The JSON/Atom Custom Search API lets you develop websites and applications to retrieve and display search results from Google Custom Search programmatically. With this API, you can use RESTful requests to get either web search or image search results in JSON or Atom format. Data format JSON/Atom Custom Search API can return results in one of two formats. There are also two external documents that are helpful resources for using this API: Google WebSearch Protocol (XML): The JSON/Atom Custom Search API provides a subset of the functionality provided by the XML API, but it instead returns data in JSON or Atom format.OpenSearch 1.1 Specification: This API uses the OpenSearch specification to describe the search engine and provide data regarding the results. Prerequisites Search engine ID By calling the API user issues requests against an existing instance of a Custom Search Engine. API key JSON/Atom Custom Search API requires the use of an API key. Pricing

Learning From Data MOOC - The Lectures Taught by Feynman Prize winner Professor Yaser Abu-Mostafa. The fundamental concepts and techniques are explained in detail. The focus of the lectures is real understanding, not just "knowing." Lectures use incremental viewgraphs (2853 in total) to simulate the pace of blackboard teaching. The Learning Problem - Introduction; supervised, unsupervised, and reinforcement learning. Is Learning Feasible? The Linear Model I - Linear classification and linear regression. Error and Noise - The principled choice of error measures. Training versus Testing - The difference between training and testing in mathematical terms. Theory of Generalization - How an infinite model can learn from a finite sample. The VC Dimension - A measure of what it takes a model to learn. Bias-Variance Tradeoff - Breaking down the learning performance into competing quantities. The Linear Model II - More about linear models. Neural Networks - A biologically inspired model. Validation - Taking a peek out of sample.

database - How to create my own recommendation engine Free Search API Are your looking for an alternative to Google Web Search API (depreciated), Yahoo Boss (commercial) or Bing Web Search API (commercial)?Try our FREE Web Search API! Prohibitive search infrastructure cost and high priced Search API are market entry barriers for innovative services and start-ups. The dramatic cost advantage of our unique p2p technology allows providing a Free Search API. With 1 million free queries per month we provide three orders of magnitude more than the incumbents do. An open platform, enabling innovation, competition & diversity in search! Build your own mobile news & search app, news clipping, trend monitoring, competitive intelligence, reputation management, brand monitoring, search engine optimization, plagiarism detection, alternative search engine, research project and more! Web Search More than 2 billion pages indexed. News Search News articles from newspapers, magazines and blogs. Trending News Trending news, grouped by topic. API Key Parameter Return Values

Learning From Data - Online Course (MOOC) A real Caltech course, not a watered-down version on YouTube & iTunes Free, introductory Machine Learning online course (MOOC) Taught by Caltech Professor Yaser Abu-Mostafa [article]Lectures recorded from a live broadcast, including Q&APrerequisites: Basic probability, matrices, and calculus8 homework sets and a final examDiscussion forum for participantsTopic-by-topic video library for easy review Outline This is an introductory course in machine learning (ML) that covers the basic theory, algorithms, and applications. What is learning? Live Lectures This course was broadcast live from the lecture hall at Caltech in April and May 2012. The Learning Problem - Introduction; supervised, unsupervised, and reinforcement learning. Is Learning Feasible? The Linear Model I - Linear classification and linear regression. Error and Noise - The principled choice of error measures. Training versus Testing - The difference between training and testing in mathematical terms.

Related: