The Basics of Natural Language Processing

How do we get non-humans to talk to us, translate text from one language to another, read and understand our documents, summarize large volumes of text rapidly, and give us answers - all in real-time? Because that's exactly what machines called Alexa or Siri does, or the conversational AI on Capital One that tells me the answer to my question (and often gets it wrong), or Google search engines and the like that not only use autocorrect to assist me with my queries, but also spit up responses that answer them.

In the same category, there are AI Translators like Google Translate that instantly translates text from one language to another (I just hover my phone over the word and Google does the rest!), and plagiarism checkers like Grammarly for the editors of C2C to check whether this article's plagiarized. (No fear!)

It's not much different from teaching children or ESL students to read and speak English, or any language for that matter.

We do it through natural language processing, called NLP.

C2C Event Alert:

Interested in NLP? Keep up with the FullFilld story: Journey to Deployment and hear how they're using NLP to build the their product. You'll connect directly with the CTO, @mpytel , @YoshEisbart and development teams and can share your own expertise, provide feedback and learn how they're overcoming similar challenges.

What's Natural Language Processing?

Natural language processing (NLP) was created in the 1950s through Alan Turing who sought to determine whether a computer could mimic human responses.

NLP is a two-step process. Scientists strip the training data to its rudiments for machines to work with. This is called "Data preprocessing". Scientists then use one or other machine learning techniques to train the algorithm to understand and respond as required.

Here's how it works.

Phase 1: Data Preprocessing.

Computer scientists break the text to its basics through the following steps:

Segmentation

The text is broken down into its smallest constituent units.

Example:

The sentence "Digital assistants are mostly female because studies show you're more attracted to a woman's voice" gets broken into:

"Digital assistants are mostly female"
"Studies show you're more attracted to a woman's voice"

2. Tokenizing

We need the algorithm to understand the constituent words, so we "tokenize" those words.

Example:

"Digital assistants are mostly female"

We isolate each word: "Digital". "Assistants". "Are". "Mostly". "Female".

3. Stop Words

We eliminate inessential words that are only there to make a sentence more cohesive. Common examples are "and", "the", "are".

Example:

In "Digital assistants are mostly female", it would be "Are". "Mostly".

Leaving us with:

"Digital". "Assistants"."Female".

4. Stemming

Now that we've broken down the document and hacked it to its essentials, we need it to explain its meaning to our machine. We do that by pointing out that some words such as Skip+ing, Skip+s, and Skip+ed are the same word with added prefixes and suffixes.

5. Lemmatization

We also consider the context and convert the sentence to its base form in terms of mood, gender etc. (This is called "lemma", or "state of being"). Common examples are "Am". "Are". "Is".

Example:

In ""Digital assistants are mostly female", we tag the word "are" as Present Plural.

6. Speech Tagging

Here's where we explain the concept of nouns, verbs, adjectives, adverbs and the like to the machine by adding those tabs to our words.

Example:

"Studies (noun) show (verb) you're (pronoun) more attracted (adverb) to a (preposition) woman's (noun) voice (noun)"

7. Named Entity Tagging

We introduce our machine to pop culture references and everyday names by flagging names of movies, important personalities or locations, and so forth that may appear in the document.

Phase 2: Algorithm Development

Computer scientists use different natural language processing methods to train the model to understand and respond accordingly. The two most common methods are:

Machine learning algorithms like Naive Bayes to teach our models' human sentiment and speech.
Rules-based systems, namely human-made rules that scientists use to program algorithms. Example: Robots in Saudi Arabia get passports. IF AI Sophia lives in the Emirates. THEN she gets guaranteed nationality.

What is NLP Used For?

Natural language processing (NLP) is used for a variety of functions that include:

Text classification, where you teach the algorithm to recognize and categorize text. Example: Gmail with its Gmail Spam Classifier that filters spam email.
Text extraction, where an algorithm is fed a quantity of material and asked to rapidly summarize it. Example: Google Scholar that summarizes quantities of academic research material.
Machine Translation, where the algorithm is trained to translate spoken or written words from one language to another.
Natural language generation, where an AI cobbles sense from random items. Example: automated journalism, where an engine scrapes the web for news and returns a summary in seconds.

Two Open Problems in NLP

As evolved as the field's become, robots are still challenged in certain areas. These include:

Context. Even the most sophisticated machines are challenged by ambiguous words. Example: You could tell an AI to meet you at the "bank" and they can go to the stream or to Wells Fargo. Likewise, you may tell the machine - You're great!', and the robot exclaims Thank you! When really you're frustrated - - You're (grunt) great.'
The evolving use of language. The model needs to be dismantled to acquire updated language and trending expressions.
Named Entity Recognition (NER). Recognizing names of "big shots" or famous companies is insufficient. Algorithms need to recognize items such as person names, organizations, locations, medical codes, quantities, monetary values, and so forth.
Sophisticated vocabulary. To be super-helpful, NLP needs to acquire a broad and nuanced vocabulary, For most NLP software applications that are (at the moment) beyond their reach.

Bottom Line

The wonder of natural language processing (NLP) is that these non-human machines are more intelligent and articulate than a regular random sampling of our human population. Their knowledge is immense, their linguistic skills incredible (the most sophisticated have mastered more than 100 languages) and their responses are mostly spot-on. They lack context, emotions, slang, and the like. That's our instructional challenge, where Google Natural Language API is said to excel with that. On the other hand, some AI researchers believe they may never acquire this human-level cognition. They're machines, after all.

Let's Connect!

Leah Zitter, Ph.D., has a Masters in Philosophy, Epistemology and Logic and a Ph.D. in Research Psychology.

The Basics of Natural Language Processing

What's Natural Language Processing?

Phase 1: Data Preprocessing.

Two Open Problems in NLP

Bottom Line

Let's Connect!

Extra Credit

Recent Articles

Generative AI: Are You Behind?!

Make "Gen AI Work": Landscape, SLMs vs. LLMs, Cost & More...

AI Cheat Sheet

Subscribe to our newsletter