Bag-of-Words Model: Definition and Implementation

Written by Leah Zitter | Nov 8, 2021 6:00:00 AM

Teaching a machine model to think is one of the most challenging - and rewarding - tasks technology can accomplish. When you want your model to recognize images, you simply convert them into numbers, or vectorize them, in a process called "feature extraction" or "feature encoding." For example, you may want to encode the image of a cat into the following vectors:

The curved shape of the ear [186]
Color of the iris, red [99]
Paws, grey [37]

But how do you train the model to recognize text? After all, text data is abstract; it's composed of words with various conceptual referents. That's where the bag-of-words (BoW) model comes in. Using this model, you place your words into one or more "bags," or multiple sets, and vectorize them on a spreadsheet. This helps you classify documents, calculate probability, detect spam, and more.

Read on to learn how the BoW model solves a series of common but critical problems.

Natural Language Processing: Understanding Text

What if I am working on an application with document-scanning capabilities and I want it to do more than just recognize text? I want to teach my ML model to understand one or more sentences. I can teach my algorithm how to convert images into binary form, but how am I going to train it on abstract text?

Solution

I convert the text data into binary metrics on a spreadsheet, just as I would with vectorized images.

Example

Sentence: "I like to go to the movies."

The keywords tell my ML-trained model how to understand the gist of a sentence. In this case, the sentence theme is Like; Movies.
I flip those keywords into binary metrics, thus: Like [1]; movies [1].
The other words (I, to, and the) are subordinate to the keywords, so I map them on my spreadsheet as 0s: [010001].

Now that I've trained my model to identify the theme in the sentence, it can proceed to do the heavy lifting, which is what it's best at. In other words, my model, now trained through BoW, can predict, analyze, categorize, and so forth.

Document Classification

I want my application to facilitate better sorting and organization of scanned documents. I need to train my model to tell me how many times certain keywords appear in certain sentences. How can the BoW model help?

Example

Sentence 1: "I like to go to the movies."

Sentence 2: "I do not like movies like this."

Each of these sentences itself is a BoW, since each is made up of a unique set of words. To determine how many times each word in the first sentence appears, I first tabulate the frequency of the words in each BoW:

	BoW (1)	BoW (2)
I	1	1
Like	1	2
To	2	0
Go	1	0
The	1	0
Movies	1	1

Then, I can count the total number of words by adding both columns. For instance, the word "movies" appears twice in our combined bag of words.

Information Retrieval

I want to be able to search scanned documents for particular text data. To do so, I need to know whether certain words appear in more than one sentence. Here's where I use the "both" feature.

Example

BoW 1: "I like to go to the movies."

BoW 2: "I do not like movies like this."

I can still use the same table, but this time I'll add a column to keep track of which words appear in both sentences:

	BoW (1)	BoW (2)	Both
I	1	1	1
Like	1	2	1
To	2	0	0
Go	1	0	0
The	1	0	0
Movies	1	1	1

Unlike the words I, like, and movies - which appear in both sentences - the words to, go, and the only appear in BoW (1). Thus, I tag the first set of words (1) and vectorize the second set of words as (0).

Scoring the Importance of Certain Terms

When I refer back to my scanned documents, I want to be able to keep track of which information is most critical. Therefore, I want my model to score the frequency of certain key terms in the document as a whole.

Example

BoW 1: "I like to go to the movies."

BoW 2: "I do not like movies like this."

The BoW model is equipped with a specific feature that enables this kind of scoring: the term frequency-inverse document frequency (TFIDF) feature:

	BoW (1)	BoW (2)	TFIDF (1)	TFIDF (2)
I	1	1
Like	1	2
To	2	0	2/7	0
Go	1	0
The	1	0
Movies	1	1

By comparing the frequency with which each word appears in each sentence to the number of words in the same sentence, the TFIDF feature scores each word by frequency per sentence. The word to appears 2 times out of the 7 total words in the first sentence. It appears 0 times in the second sentence.

Probability

Finally, I want to make sure my scanned documents are coming from a trustworthy source. The bag-of-words method is frequently used for spam detection.

Example

Take these two phrases, which could easily pass as email subject lines:

"Send money to me through PayPal"
"Get rich today"

One is legitimate while the other is spam. How can I train my model to know which to delete? First, I use Bayes' theorem of probability:

P(L/S) vs. P(S/S)

L= Legitimate; S= Spam

This theorem determines how likely a word is to be used in a spam email.

Then, I categorize either word string into keywords, assigning each string a matched probability. For example: PayPal=legitimate, given the probability of 0%. The words Money, Get, Rich, and Today are each weighted 10%.

Finally, I multiply spam word frequencies to get my results:

Legit	"Send money to me through PayPal"	10%
Spam	"Get rich today"	30%

As a result, I train my ML model to conclude that sentences like BoW(2) are highly likely to be spam.

Other Uses

The examples above describe a series of use cases for the BoW model, but there are others, too. Here are a few more potential uses for the BoW model:

Sentiment analysis, also known as opinion mining, in which online text (such as social content), is mined to evaluate the writer's attitude.
Language modeling, to determine the probability of a given sequence of words occurring in a particular string of words.
Computer vision, in which particular images are given the BoW treatment. In this case, the method is called the bag-of-visual-words model.

Flaws

In some contexts, using the bag-of-words model can introduce unintended problems. Watch out for these potential issues when using this model:

Certain documents or input data may be too sophisticated, complex, or overly large for the limited BoW model.
Too few significant words and too many words with no objective or practical meaning may result in too many null values, rendering the vectorization useless.
If you forget a hyphen between words, one word, such as "home-run," could be split into "home" and "run," and then scored higher than it deserves, skewing results.
Misspellings - such as "tank" instead of "thank," or "gr8" for "great," distort algorithmic results.
BoW ignores linguistic nuances and context, so certain words or word strings could be scored higher than they deserve. This could be remedied with transformer-based deep learning models like Bidirectional Encoder Representations (BERT), which use neural networks to better discern the context of words in search queries.

Many of these issues could also be remedied with Google Natural Language API, which applies natural-language understanding (NLU), or natural-language interpretation (NLI), to help computers understand and respond to humans in our own language.

Have you ever worked with the BoW model? Would the BoW model be useful for any projects in your ML workflow? Reach out and let us know what you're thinking.

Extra Credit:

View full post