background preloader

Natural language processing

Natural language processing
Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation. History[edit] The history of NLP generally starts in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence. The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. Up to the 1980s, most NLP systems were based on complex sets of hand-written rules. NLP using machine learning[edit] Major tasks in NLP[edit] Parsing

The Global WordNet Associaton Cognitive model A cognitive model is an approximation to animal cognitive processes (predominantly human) for the purposes of comprehension and prediction. Cognitive models can be developed within or without a cognitive architecture, though the two are not always easily distinguishable. History[edit] Cognitive modeling historically developed within cognitive psychology/cognitive science (including human factors), and has received contributions from the fields of machine learning and artificial intelligence to name a few. Box-and-arrow models[edit] A number of key terms are used to describe the processes involved in the perception, storage, and production of speech. Computational models[edit] A computational model is a mathematical model in computational science that requires extensive computational resources to study the behavior of a complex system by computer simulation. Symbolic[edit] . expressed in characters, usually nonnumeric, that require translation before they can be used Subsymbolic[edit]

WordNet WordNet is a lexical database for the English language.[1] It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets. The purpose is twofold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications. The database and software tools have been released under a BSD style license and can be downloaded and used freely. The database can also be browsed online. History and team members[edit] WordNet was created at the Cognitive Science Laboratory of Princeton University under the direction of psychology professor George Armitage Miller. Database contents[edit] Example entry "Hamburger" in WordNet good, right, ripe – (most suitable or right for a particular purpose; "a good time to plant tomatoes"; "the right time to act"; "the time is ripe for great sociological changes")

Stochastic process Stock market fluctuations have been modeled by stochastic processes. In probability theory, a stochastic process /stoʊˈkæstɪk/, or sometimes random process (widely used) is a collection of random variables; this is often used to represent the evolution of some random value, or system, over time. This is the probabilistic counterpart to a deterministic process (or deterministic system). Instead of describing a process which can only evolve in one way (as in the case, for example, of solutions of an ordinary differential equation), in a stochastic or random process there is some indeterminacy: even if the initial condition (or starting point) is known, there are several (often infinitely many) directions in which the process may evolve. Formal definition and basic properties[edit] Definition[edit] Given a probability space and a measurable space , an S-valued stochastic process is a collection of S-valued random variables on , indexed by a totally ordered set T ("time"). where each . . . . . .

About WordNet - WordNet - About WordNet Probability matching Probability matching is a suboptimal decision strategy in which predictions of class membership are proportional to the class base rates. Thus, if in the training set positive examples are observed 60% of the time, and negative examples are observed 40% of the time, then the observer using a probability-matching strategy will predict (for unlabeled examples) a class label of "positive" on 60% of instances, and a class label of "negative" on 40% of instances. The optimal Bayesian decision strategy (to maximize the number of correct predictions, see Duda, Hart & Stork (2001)) in such a case is to always predict "positive" (i.e., predict the majority category in the absence of other information), which has 60% chance of winning rather than matching which has 52% of winning (where p is the probability of positive realization, the result of matching would be , here ).

Automatic summarization Methods[edit] Methods of automatic summarization include extraction-based, abstraction-based, maximum entropy-based, and aided summarization. Extraction-based summarization[edit] Two particular types of summarization often addressed in the literature are keyphrase extraction, where the goal is to select individual words or phrases to "tag" a document, and document summarization, where the goal is to select whole sentences to create a short paragraph summary. Abstraction-based summarization[edit] Extraction techniques merely copy the information deemed most important by the system to the summary (for example, key clauses, sentences or paragraphs), while abstraction involves paraphrasing sections of the source document. While some work has been done in abstractive summarization (creating an abstract synopsis like that of a human), the majority of summarization systems are extractive (selecting a subset of sentences to place in a summary). Maximum entropy-based summarization[edit]

ACT-R Most of the ACT-R basic assumptions are also inspired by the progress of cognitive neuroscience, and ACT-R can be seen and described as a way of specifying how the brain itself is organized in a way that enables individual processing modules to produce cognition. Inspiration[edit] What ACT-R looks like[edit] This means that any researcher may download the ACT-R code from the ACT-R website, load it into a Lisp distribution, and gain full access to the theory in the form of the ACT-R interpreter. Also, this enables researchers to specify models of human cognition in the form of a script in the ACT-R language. The language primitives and data-types are designed to reflect the theoretical assumptions about human cognition. Like a programming language, ACT-R is a framework: for different tasks (e.g., Tower of Hanoi, memory for text or for list of words, language comprehension, communication, aircraft controlling), researchers create "models" (i.e., programs) in ACT-R. Brief outline[edit]

Snowball Cognitive architecture Distinctions[edit] Some well-known cognitive architectures[edit] See also[edit] MALLET homepage MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. MALLET includes sophisticated tools for document classification: efficient routines for converting text to “features”, a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics. Quick Start / Developer’s Guide In addition to classification, MALLET includes tools for sequence tagging for applications such as named-entity extraction from text. Algorithms include Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random Fields. Topic models are useful for analyzing large collections of unlabeled text. Many of the algorithms in MALLET depend on numerical optimization. The toolkit is Open Source Software, and is released under the Apache 2.0 License.

Related: