Teaching a machine model to think is one of the most challenging - and rewarding - tasks technology can accomplish. When you want your model to recognize images, you simply convert them into numbers, or vectorize them, in a process called "feature extraction" or "feature encoding." For example, you may want to encode the image of a cat into the following vectors:
But how do you train the model to recognize text? After all, text data is abstract; it's composed of words with various conceptual referents. That's where the bag-of-words (BoW) model comes in. Using this model, you place your words into one or more "bags," or multiple sets, and vectorize them on a spreadsheet. This helps you classify documents, calculate probability, detect spam, and more.
Read on to learn how the BoW model solves a series of common but critical problems.
What if I am working on an application with document-scanning capabilities and I want it to do more than just recognize text? I want to teach my ML model to understand one or more sentences. I can teach my algorithm how to convert images into binary form, but how am I going to train it on abstract text?
I convert the text data into binary metrics on a spreadsheet, just as I would with vectorized images.
Sentence: "I like to go to the movies."
Now that I've trained my model to identify the theme in the sentence, it can proceed to do the heavy lifting, which is what it's best at. In other words, my model, now trained through BoW, can predict, analyze, categorize, and so forth.
I want my application to facilitate better sorting and organization of scanned documents. I need to train my model to tell me how many times certain keywords appear in certain sentences. How can the BoW model help?
Sentence 1: "I like to go to the movies."
Sentence 2: "I do not like movies like this."
Each of these sentences itself is a BoW, since each is made up of a unique set of words. To determine how many times each word in the first sentence appears, I first tabulate the frequency of the words in each BoW:
BoW (1) |
BoW (2) |
|
I |
1 |
1 |
Like |
1 |
2 |
To |
2 |
0 |
Go |
1 |
0 |
The |
1 |
0 |
Movies |
1 |
1 |
Then, I can count the total number of words by adding both columns. For instance, the word "movies" appears twice in our combined bag of words.
I want to be able to search scanned documents for particular text data. To do so, I need to know whether certain words appear in more than one sentence. Here's where I use the "both" feature.
BoW 1: "I like to go to the movies."
BoW 2: "I do not like movies like this."
I can still use the same table, but this time I'll add a column to keep track of which words appear in both sentences:
BoW (1) |
BoW (2) |
Both |
|
I |
1 |
1 |
1 |
Like |
1 |
2 |
1 |
To |
2 |
0 |
0 |
Go |
1 |
0 |
0 |
The |
1 |
0 |
0 |
Movies |
1 |
1 |
1 |
Unlike the words I, like, and movies - which appear in both sentences - the words to, go, and the only appear in BoW (1). Thus, I tag the first set of words (1) and vectorize the second set of words as (0).
When I refer back to my scanned documents, I want to be able to keep track of which information is most critical. Therefore, I want my model to score the frequency of certain key terms in the document as a whole.
Example
BoW 1: "I like to go to the movies."
BoW 2: "I do not like movies like this."
The BoW model is equipped with a specific feature that enables this kind of scoring: the term frequency-inverse document frequency (TFIDF) feature:
BoW (1) |
BoW (2) |
TFIDF (1) |
TFIDF (2) |
|
I |
1 |
1 |
||
Like |
1 |
2 |
||
To |
2 |
0 |
2/7 |
0 |
Go |
1 |
0 |
||
The |
1 |
0 |
||
Movies |
1 |
1 |
By comparing the frequency with which each word appears in each sentence to the number of words in the same sentence, the TFIDF feature scores each word by frequency per sentence. The word to appears 2 times out of the 7 total words in the first sentence. It appears 0 times in the second sentence.
Finally, I want to make sure my scanned documents are coming from a trustworthy source. The bag-of-words method is frequently used for spam detection.
Take these two phrases, which could easily pass as email subject lines:
One is legitimate while the other is spam. How can I train my model to know which to delete? First, I use Bayes' theorem of probability:
P(L/S) vs. P(S/S)
L= Legitimate; S= Spam
This theorem determines how likely a word is to be used in a spam email.
Then, I categorize either word string into keywords, assigning each string a matched probability. For example: PayPal=legitimate, given the probability of 0%. The words Money, Get, Rich, and Today are each weighted 10%.
Finally, I multiply spam word frequencies to get my results:
Legit |
"Send money to me through PayPal" |
10% |
Spam |
"Get rich today" |
30% |
As a result, I train my ML model to conclude that sentences like BoW(2) are highly likely to be spam.
The examples above describe a series of use cases for the BoW model, but there are others, too. Here are a few more potential uses for the BoW model:
In some contexts, using the bag-of-words model can introduce unintended problems. Watch out for these potential issues when using this model:
Many of these issues could also be remedied with Google Natural Language API, which applies natural-language understanding (NLU), or natural-language interpretation (NLI), to help computers understand and respond to humans in our own language.
Have you ever worked with the BoW model? Would the BoW model be useful for any projects in your ML workflow? Reach out and let us know what you're thinking.