CommonCrawl

Using REST to Invoke the API - Custom Search The JSON/Atom Custom Search API lets you develop websites and applications to retrieve and display search results from Google Custom Search programmatically. With this API, you can use RESTful requests to get either web search or image search results in JSON or Atom format. Data format JSON/Atom Custom Search API can return results in one of two formats. There are also two external documents that are helpful resources for using this API: Google WebSearch Protocol (XML): The JSON/Atom Custom Search API provides a subset of the functionality provided by the XML API, but it instead returns data in JSON or Atom format.OpenSearch 1.1 Specification: This API uses the OpenSearch specification to describe the search engine and provide data regarding the results. Prerequisites Search engine ID By calling the API user issues requests against an existing instance of a Custom Search Engine. API key JSON/Atom Custom Search API requires the use of an API key. Pricing

Solving Today’s Biggest Problems Requires an Entirely New Approach to Data Old Data is inaccessible to most New Data available to all users Specialized skills in math and computer science. Any business user, scientist, researcher or domain expert. Dashboards and Charts Breakthrough outcomes Operational and business intelligence. Topological networks show hidden insights.

Les outils manquants de l'OpenData, dans avenir Réflexions au sujet des outils de l'OpenData, entamés lors de la préparation de mon intervention à l'événement L'OpenData et nous, et nous, et nous ?, davantage axées sur le point de vue du développeur et sur ce qu'il serait intéressant de faire au niveau technique. Le GoogHub de la donnée La décentralisation nécessite d'avoir un index centralisé, que ce soit Google pour le Web de documents ou GitHub pour les DCVS il faut un endroit où l'on puisse chercher parmi les sources, toujours plus nombreuses. Un service est nécessaire pour indexer le Web des données, informer sur le versionnement et la fraîcheur des données, voire peut-être servir de proxy à une partie de ces données. Idéalement, dans un Web de données liées, un tel index serait moins utile car il suffirait de suivre les liens mais force est de constater que l'on en est aux données ouvertes et pas très liées. Des frameworks d'exploitation Une plateforme de monétisation

K Means Clustering with Tf-idf Weights | Blog | Jonathan Zong Unsupervised learning algorithms in machine learning impose structure on unlabeled datasets. In Prof. Andrew Ng's inaugural ml-class from the pre-Coursera days, the first unsupervised learning algorithm introduced was k-means, which I implemented in Octave for programming exercise 7. K-means is an algorithm designed to find coherent groups of data, a.k.a. clusters. Tf-idf Weighting Before being able to run k-means on a set of text documents, the documents have to be represented as mutually comparable vectors. Cosine Similarity Now that we're equipped with a numerical model with which to compare our data, we can represent each document as a vector of terms using a global ordering of each unique term found throughout all of the documents, making sure first to clean the input. k-means In a general sense, k-means clustering works by assigning data points to a cluster centroid, and then moving those cluster centroids to better fit the clusters themselves.

Opendata & Quality Cela fait un tour de temps que je navigue et observe ce qui est mis en ligne sous le nom d’Opendata. Bien sûr, ce sont des données, bien sûr elles sont mises à disposition, bien sûr il y a souvent une fiche de méta données plus ou moins complètes, et il y a même des portails qui s’organisent pour les mettre en catalogue … bref ce sont là des ingrédients qui disent que ce sont bien des données publiques répondant aux exigences d’un cahier des charges. Mais justement, parlons un peu de ce cahier des charges. Il y a comme une partie importante du problème qui est oubliée. Le jeu de données, le dataset, doit être intrinsèquement de qualité et cette qualité semble ne pas être clairement définie. Aujourd’hui, le dataset est de mieux en mieux défini extérieurement. Par exemple, un fichier produit par un traitement de textes a peu de chance de servir à quelque chose dans un dispositif de traitement automatique sauf si on a déjà l’application faite juste pour ce fichier.

Clustering Snippets With Carrot2 | Index Data We’ve been investigating ways we might add result clustering to our metasearch tools. Here’s a short introduction to the topic and to an open source platform for experimenting in this area. Clustering Using a search interface that just takes some keywords often leads to miscommunication. To aid the user in narrowing results to just those applicable to the context they’re thinking about, a good deal of work has been done in the area of “clustering” searches. One common way to represent a document, both for searching and data mining, is the vector space model. This kind of bag-of-words model is very useful for separating documents into groups. Another differentiator among clustering algorithms is when the clustering happens, before or after search. Similarly, we can leverage another part of the search system: snippet generation. Carrot2 Suffix Tree Clustering (STC) is one of the first feasible snippet-based document clustering algorithms, proposed in 1998 by Zamir and Etzioni. Lingo

Open data 71 : un projet de qualité, mais des résultats en demi-teinte Focus Mardi 28 Aout 2012 En juin 2011, le Département de Saône-et-Loire a décidé d’ouvrir et de partager les données publiques dont il dispose par le biais du projet "Open data 71". Quelle est l’origine de l’ouverture des données en Saône–et-Loire ? La libération des données en Saône-et-Loire tient en premier lieu à une volonté politique très forte, liée à la commande du président du Conseil général, Arnaud Montebourg, aujourd’hui ministre du Redressement productif, et qui en a fait un projet important de démocratie. La particularité du projet mené a été de libérer tout ce que la loi nous autorise, sans se donner de limite dans la libération des données. L’originalité de notre solution tient également au concept que nous avons mis en avant, celui de faire de l’open data pour tous et pas seulement pour les développeurs ou les spécialistes de la donnée. La transparence n’est-elle pas que formelle quand l’ouverture n’est pas accompagnée d’une formation à destination des citoyens ? ShareThis

Free Search API Are your looking for an alternative to Google Web Search API (depreciated), Yahoo Boss (commercial) or Bing Web Search API (commercial)?Try our FREE Web Search API! Prohibitive search infrastructure cost and high priced Search API are market entry barriers for innovative services and start-ups. The dramatic cost advantage of our unique p2p technology allows providing a Free Search API. With 1 million free queries per month we provide three orders of magnitude more than the incumbents do. Build your own mobile news & search app, news clipping, trend monitoring, competitive intelligence, reputation management, brand monitoring, search engine optimization, plagiarism detection, alternative search engine, research project and more! Web Search More than 2 billion pages indexed. News Search News articles from newspapers, magazines and blogs. Trending News Trending news, grouped by topic. Trending Topics Trending news, grouped by topic. API Key The FAROO API requires an API key. Parameter

simple web crawler / scraper tutorial using requests module in python Let me show you how to use the Requests python module to write a simple web crawler / scraper. So, lets define our problem first. In this page: I am publishing some programming problems. So, now I shall write a script to get the links (url) of the problems. So, lets start. First make sure you can get the content of the page. import requests def get_page(url): r = requests.get(url) print r.status_code with open("test.html", "w") as fp: fp.write(r.text) if __name__ == "__main__": url = ' get_page(url) Now run the program: $ python cpbook_crawler.py 200Traceback (most recent call last): File "cpbook_crawler.py", line 15, in Hmm... we got an error. import reimport requests Now run the script: $ python cpbook_crawler.py [] We got an empty list. content = content.replace("\n", '') You should add this line and run the program again. Now we write the regular expression to get the list of the urls.

» How to make a web crawler in under 50 lines of Python code 'Net Instructions How to make a web crawler in under 50 lines of Python code September 24, 2011 Interested to learn how Google, Bing, or Yahoo work? Wondering what it takes to crawl the web, and what a simple web crawler looks like? In under 50 lines of Python (version 3) code, here’s a simple web crawler! (The full source with comments is at the bottom of this article). And let’s see how it is run. Okay, but how does it work? Let’s first talk about what a web crawler’s purpose is. Web page content (the text and multimedia on a page)Links (to other web pages on the same website, or to other websites entirely) Which is exactly what this little “robot” does. Is this how Google works? Sort of. *Your search terms actually visit a number of databases simultaneously such as spell checkers, translation services, analytic and tracking servers, etc. Let’s look at the code in more detail! The following code should be fully functional for Python 3.x. Magic!

How to write a multi-threaded webcrawler in Java Table of Contents This page Here you can... ... learn how to write a multithreaded Java application... learn how to write a webcrawler... by the way learn how to write stuff that is object-oriented and reusable... or use the provided webcrawler more or less off-the-shelf. More or less in this case means that you have to be able to make minor adjustments to the Java source code yourself and compile it. This web page discusses the Java classes that I originally wrote to implement a multithreaded webcrawler in Java. download the Java source code for the multithreaded webcrawler This code is in the public domain. 1 Why another webcrawler? Why would anyone want to program yet another webcrawler? Although wget is powerful, for my purposes (originally: obtaining .wsdl-files from the web) it required a webcrawler that allowed easy customization. Sun's tutorial webcrawler on the other hand lacks some important features. 2 Multithreading Processing items in a queue Implementation of the queue Messages