thewiki

Visitors: 0

By Wilson Levi 1 Jan 2022

Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.

History
Natural Language Processing has its roots in the 1950s. Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing Test as a criterion of intelligence, though at the time that was not articulated as a problem separate from artificial intelligence. The proposed test includes a task that involves the automated interpretation and generation of natural language.

NLP Use Cases
NLP represents the automatic handling of natural human languages like speech and text, and although the concept itself is fascinating, the real value behind this technology comes from the use cases. NLP can help you with lots of tasks and the fields of application just seem to increase on a daily basis. Some of the use cases are;

Text and Speech Processing

Optical Character Recognition (OCR): Given an image representing printed text, determine the corresponding text.
Speech Recognition: Given a sound clip of a person speaking, determine the textual representation of the speech. This is the opposite of text to speech and is one of the extremely difficult problems colloquially termed "AI-Complete". In natural speech, there are hardly any pauses between successive words, and thus speech segmentation is a necessary subtask of speech recognition. In most spoken languages, the sounds representing successive letters blend into each other in a process termed co-articulation, so the conversation of the analog signal to discrete characters can be a very difficult process. Also, given that words in the same language are spoken by people with different accents, the speech recognition software must be able to recognize the wide variety of input as being identical to each other in terms of its textual equivalent.
Speech Segmentation: Given a sound clip of a person or people, separate it into words. A subtask of speech recognition and typically grouped with it.
Text to Speech: Given a text, transform those units and produce a spoken representation. Text-to-speech can be used to aid the visually impaired.
Word Segmentation (Tokenization): Separate a chunk of continuous text into separate words. For a language like English, this is fairly trivial, since words are usually separated by spaces. However, some written languages like Chinese, Japanese, and Thai do not mark word boundaries in such a fashion, and in those languages, text segmentation is a significant task requiring knowledge of the vocabulary and morphology of words in the language. Sometimes this process is also used in cases like bag of words (BOW) creation in data mining.

Morphological Analysis

Lemmatization: The task of removing inflectional endings only and to return the base dictionary form of a word which is also known as a lemma. Lemmatization is another technique for reducing words to their normalized form. But in this case, the transformation actually uses a dictionary to map words to their actual form.
Morphological Segmentation: Separate words into individual morphemes and identify the class of morphemes. The difficulty of this task depends greatly on the complexity of the morphology (i.e., the structure of words) of the language being considered.
Part of Speech Tagging: Given a sentence, determine the part of speech (POS) for each word. Many words, especially common ones, can serve as multiple parts of speech. For example, "book" can be a noun ("the book on the table") or a verb ("to book a flight").
Stemming: The process of reducing inflected (or sometimes derived) words to a base form (e.g., "close" will be the root for "closed", "closing", "close", "closer", etc). Stemming yields similar results as lemmatization, but does so on grounds of rules, not a dictionary.

Syntactic Analysis

Grammar Induction: Generate a formal grammar that describes a language's syntax.
Sentence Breaking: Given a chunk of text, find the sentence boundaries. Sentence boundaries are often marked by periods or other punctuation marks, but these same characters can serve other purposes (e.g., making abbreviations).
Parsing: Determine the parse tree (grammatical analysis) of a given sentence. The grammar for natural language is ambiguous and typical sentences have multiple possible analyses; perhaps surprisingly, for a typical sentence, there may be thousands of potential parses (most of which will seem completely nonsensical to a human). There are two primary types of parsing: dependency parsing and constituency parsing.
Dependency parsing focuses on the relationships between words in a sentence (making things like primary objects and predicates), whereas constituency parsing focuses on building out the parse tree using a probabilistic context-free grammar (PCFG).

Lexical Semantics (of individual words in context)

Lexical semantics: What is the computational meaning of individual words in context?
Distributional semantics: How can we learn semantic representations from data?
Named Entity Recognition (NER): Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g., person, location, organization). Although capitalization can aid in recognizing named entities in languages such as English, this information cannot aid in determining the type of named entity, and in any case, is often inaccurate or insufficient. For example, the first letter of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized. Furthermore, many other languages in non-Western scripts (e.g., Chinese or Arabic) do not have any capitalization at all, and even languages with capitalization may not consistently use it to distinguish names. For example, German capitalizes all nouns, regardless of whether they are names, and French and Spanish do not capitalize names that serve as adjectives.
Sentiment Analysis: Extract subjective information usually from a set of documents, often using online reviews to determine "polarity" about specific objects. It is especially useful for identifying trends of public opinion in social media, for marketing.
Terminology Extraction: The goal of terminology extraction is to automatically extract relevant terms from a given corpus.
Word Sense Disambiguation (WSD): Many words have more than one meaning; we have to select the meaning which makes the most sense in context. For this problem, we are typically given a list of words and associated word senses, e.g., from a dictionary or an online resource such as WordNet.
Entity Linking: Many words - typically proper names - refer to named entities; here we have to select the entity (a famous individual, a location, a company, etc) which is referred to in context.

Relational Semantics (semantics of individual sentences)

Relationship Extraction: Given a chunk of text, identify the relationships among named entities (e.g., who is married to whom).
Semantic Parsing: Given a piece of text (typically a sentence), produce a formal representation of its semantics, either as a graph (e.g., in AMR parsing) or in accordance with a logical formalism (e.g., in DRT parsing). This challenge typically includes aspects of several more elementary NLP tasks from semantics (e.g., semantic role labeling, word sense disambiguation) and can be extended to include full-fledged disclosures.
Semantic Role Labeling: Given a sentence, identify and disambiguate semantic predicates (e.g., verbal frames), then identify and classify the frame elements (semantic roles).

thewiki

Natural Language Processing (NLP)

Topics

Wilson Levi

Navigation

Quote of the Day