Apache Tika - Apache Tika

State of Adversarial Stylometry: can you change your prose-style? Today at the Chaos Computer Congress in Berlin (28C3), Sadia Afroz and Michael Brennan presented a talk called "Deceiving Authorship Detection," about research from Drexel College on "Adversarial Stylometry," the practice of identifying the authors of texts who don't want to be identified, and the process of evading detection. Stylometry has made great and well-publicized advances in recent years (and it made the news with scandals like "Gay Girl in Damascus"), but typically this has been against authors who have not taken active, computer-assisted countermeasures at disguising their distinctive "voice" in prose. As part of the presentation, the Drexel Team released Anonymouth, a free/open tool that partially automates the process of evading authorship detection. The tool is still a rough alpha, and it requires human intervention to oversee the texts it produces, but it is still an exciting move in adversarial stylometry tools. Privacy, Security and Automation Lab

Index Microsoft Office Files with Lucene | acidum.de Christoph Hartmann on January 7th, 2009 Within my current research project I faced the challenge to index a whole bunch of files. To be platform independent the Java programming language was the first choice. Then I came along the Lucene project. Lucene is an open-source project that “provides Java-based indexing and search technology”. I looked at two projects: While Tika is not available as a binary download Aperture is. Just download the Tika source code viasvn checkout tika and use maven to install the binary into your local maven repository. The following part do the core binding between Tika and Lucene. logger.debug("Indexing " + file);try { Document doc = null; // parse the document synchronized (contentParserAccess) { doc = contentParser.getDocument(file); } // put it into Lucene if (doc ! The ContentParser calls the TikaParser for each file and put the metadata it returns into a Lucene document. A custom tika parser may looks like:

Apache Harmony - Open Source Java Platform Adam Parrish · Getting data from the web Python: hidden details In the interest of brevity, we’ve skipped over some fairly important details of Python. Here’s our chance to play catch-up. Other kinds of loops; loop control The for loop is far and away the most common loop in Python. But there’s another kind of loop that you’ll encounter frequently: the while loop. >>> i = 0 >>> while i < 10: ... i += 1 ... print i ... 1 2 3 4 5 6 7 8 9 10 Python also has two loop control statements. >>> i = 0 >>> while i < 10: ... i += 1 ... if i % 2 == 1: ... continue ... print i ... 2 4 6 8 10 The continue statement causes Python to skip back to the top of the loop; the remaining statements aren't executed. Finally, we have break, which causes Python to drop out of the loop altogether. >>> i = 0 >>> while i < 10: ... i += 1 ... if i > 5: ... break ... print i ... 1 2 3 4 5 Here, as soon as i achieves a value greater than 5, the break statement gets executed, and Python stops executing the loop. Tuples from module import stuff File objects URLs 01.<? 02.

Lucene - Index File Formats Index File Formats This document defines the index file formats used in Lucene version 3.0. If you are using a different version of Lucene, please consult the copy of docs/fileformats.html that was distributed with the version you are using. Apache Lucene is written in Java, but several efforts are underway to write versions of Lucene in other programming languages. As Lucene evolves, this document should evolve. Compatibility notes are provided in this document, describing how file formats have changed from prior versions. In version 2.1, the file format was changed to allow lock-less commits (ie, no more commit lock). In version 2.3, the file format was changed to allow segments to share a single set of doc store (vectors & stored fields) files. Definitions The fundamental concepts in Lucene are index, document, field and term. An index contains a sequence of documents. A document is a sequence of fields. The same string in two different fields is considered a different term. Segments

Mastering Google Analytics Custom Variables I’ve got a stack of posts that I want to write, and realized that the all deal with Custom Variables. So, to make sure that we’re all on the same page when it comes to custom vars, here’s my guide to Mastering Google Analytics Custom Variables. For those of you that have not used custom variables, CVs are a way for you to insert custom data into Google Analytics. There are 4 parts to a custom variable: 1. Name & Value Custom variables are name-value pairs of data. Google Analytics will show you a list of all the custom variable names in a list and then let you drill down into the list and see all of the values. Here’s an example. Then I can click on “Year” to a get a list of all the values: Custom variables can also be used in custom reports and advanced segments. Index or Slot The index is a way to organize your custom variables. You can technically have more than 5 custom variables, but we need to discuss the next concept, scope, and how it impacts the index. Scope The Code Super Nerd Stuff

IndexWriterConfig (Lucene 4.6.0 API) Expert: set the interval between indexed terms. Large values cause less memory to be used by IndexReader, but slow random-access to terms. Small values cause more memory to be used by an IndexReader, and speed random-access to terms. This parameter determines the amount of computation required per query term, regardless of the number of documents that contain that term. In particular, numUniqueTerms/interval terms are read into memory by an IndexReader, and, on average, interval/2 terms must be scanned for each random term access. Takes effect immediately, but only applies to newly flushed/merged segments. NOTE: This parameter does not apply to all PostingsFormat implementations, including the default one in this release. Note that other implementations may have their own parameters, or no parameters at all.

Build a Better Sub-$200 Linux PC No one who expected the languid economy to have fully revived by now can be cheered by the way things have gone this summer; the volatile stock market alone has been a constant dispenser of heartache. So if you’re in need of a computer, even just a small one to do basic, everyday things, you may have put it off because of the uncertainty currently surrounding, well, everything. But it’s possible to build a PC yourself for an obscenely low cash layout—less than you'd spend on pretty much any full system on the market. In fact, you can even do it for as little as $200. And no, that’s not a typo. We first proved this last year , back when it looked like the economy’s most turbulent days were behind it. The answer to the first question was a no-brainer: absolutely. It was also obvious that our new desktop would be superior in terms of performance. As for whether we could spend a lot less this year than we could in 2010...