4 free data tools for journalists (and snoops) - O'Reilly Radar Note: The following is an excerpt from Pete Warden’s free ebook “Where are the bodies buried on the web? Big data for journalists.” There’s been a revolution in data over the last few years, driven by an astonishing drop in the price of gathering and analyzing massive amounts of information. The technology is also getting easier to use. What does this mean for journalists? Many of you will already be familiar with WHOIS, but it’s so useful for research it’s still worth pointing out. You can also enter numerical IP addresses here and get data on the organization or individual that owns that server. Blekko The newest search engine in town, one of Blekko’s selling points is the richness of the data it offers. The first tab shows other sites that are linking to the current domain, in popularity order. The other handy tab is “Crawl stats,” especially the “Cohosted with” section: This tells you which other websites are running from the same machine. bit.ly Then click on the ‘Info Page+’ link:
Maintenance Management Management of Maintenance Complexity Across a Global Footprint Verisae optimizes facility management and equipment maintenance departments by improving operational efficiency and cutting costs. Verisae offers a comprehensive software solution that helps organizations monitor, measure, track, and manage their facility and equipment maintenance processes. Verisae's Computerized Maintenance Management System (CMMS) enables organizations to maximize many facilities and equipment maintenance management processes on a single software and services platform. This derives the greatest value in the shortest amount of time with the least amount of resource usage. Top Global Retailers Use Verisae's CMMS Software Verisae’s facility management and equipment maintenance software is used by some of the largest retailers in the world. The Verisae CMMS system actively tracks over three million individual assets across more than 28,000 sites worldwide.
Document Management System | Open Source DMS - OpenKM tf–idf One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model. Motivation[edit] Suppose we have a set of English text documents and wish to determine which document is most relevant to the query "the brown cow". A simple way to start out is by eliminating documents that do not contain all three words "the", "brown", and "cow", but this still leaves many documents. However, because the term "the" is so common, this will tend to incorrectly emphasize documents which happen to use the word "the" more frequently, without giving enough weight to the more meaningful terms "brown" and "cow". Mathematical details[edit] tf–idf is the product of two statistics, term frequency and inverse document frequency. The inverse document frequency is a measure of whether the term is common or rare across all documents. with Then tf–idf is calculated as Example of tf–idf[edit] Idf is a bit more involved:
SCADA - Supervisory Control and Data Acquisition Kea Keywords and keyphrases (multi-word units) are widely used in large document collections. They describe the content of single documents and provide a kind of semantic metadata that is useful for a wide variety of purposes. The task of assigning keyphrases to a document is called keyphrase indexing. For example, academic papers are often accompanied by a set of keyphrases freely chosen by the author. KEA is an algorithm for extracting keyphrases from text documents. KEA is implemented in Java and is platform independent. In real life, the Kea is one of New Zealand's native parrots, famed for theft, destroying cars and cameras, forming street gangs, pecking sheep to death for their delicious kidney fat, and other cutesy antics. Thanks to Gordon Paynter, who has create the original version of this site. Digital Libraries and Machine Learning Labs Computer Science Department The University of Waikato Private Bag 3105 Hamilton, New Zealand
ActiveWarehouse: Extract-Transform-Load Tool The ActiveWarehouse ETL component provides a means of getting data from multiple data sources into your data warehouse. The links in the side bar provide additional information on ETL. Here’s how to get rolling: Install the Gem Get to your command line and type sudo gem install activewarehouse-etl on Linux or OS X or type gem install activewarehouse-etl on Windows. ActiveWarehouse ETL depends on ActiveSupport, ActiveRecord, adapter_extensions and FasterCSV. You can also download the packages in Zip, Gzip, or Gem format from the ActiveWarehouse files section on RubyForge. Create Control Files Create the ETL control files. Execute the etl command Execute the etl command passing the control file name as the argument. Right now the ETL component has the following functionality: Fixed-width and delimited file parsing File and database source File and database destination Virtual source fields, which can be populated via output from Ruby code Support for pre- and post-processing code Transform pipeline
How Energy Firms Benefit from Streaming Analytics | Vitria Energy & Utilities Firms Streaming Analytics for Energy Firms – Why it’s Important Most energy and utilities firms have deployed or plan to deploy smart grids and smart meters to receive more data about their distribution networks as well as the consumption patterns of their customers. However, one of the biggest challenges that energy and utilities firms face lies in being able to collect, correlate, and analyze streaming Big Data in real-time so as to be able to proactively respond to situations that might present a threat or a revenue opportunity or impact a customer’s experience. Traditional business intelligence (BI) and data warehousing approaches that rely on persistent data and batch-oriented analysis introduces far too much latency to be able to deliver insights in a timely manner. Introducing Vitria OI for Streaming Analytics With Vitria OI, energy and utilities firms benefit from:
Automatic key extraction - OpenKM Documentation OpenKM uses KEA for extracting keyphrases from text documents. KEA it by default can be either used for free indexing or for indexing with a controlled vocabulary, but with OpenKM is mandatory having a controled vocabulary. OpenKM automatic extractrion keyphrases is based in KEA 5.0. If order having KEA running in OpenKM must be a well done configured vocabulary (Thesaurus). KEA is a training module that uses a Thesaurus as the controller vocabulary. In order how to configure OpenKM Thesaurus take a look at Thesaurus in installation guide. To creating KEA model must checkout openkm and thesaurus modules: Select the svn type and type the url to refer openkm: Select the svn type and type the url to refer thesaurus: In KEA web page could downloading file that comes with some example how to creating KEA model. Setting the SKOS file kea.thesaurus.skos.file=file.rdf
MALLET homepage MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. MALLET includes sophisticated tools for document classification: efficient routines for converting text to “features”, a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics. Quick Start / Developer’s Guide In addition to classification, MALLET includes tools for sequence tagging for applications such as named-entity extraction from text. Algorithms include Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random Fields. These methods are implemented in an extensible system for finite state transducers. Topic models are useful for analyzing large collections of unlabeled text. Many of the algorithms in MALLET depend on numerical optimization.