Hello-World to Text Vectorization for ML problems

Aung Sett Paing
7 min readMar 15, 2022

Ways of how to represent texts into numeric vectors using Bag-of-words and TFIDF

Photo by Sven Brandsma on Unsplash

This article will only cover the fundamental approaches to vectorize texts for the machine learning problems although there are SOTA approaches like word embeddings. For both BoW and TFIDF models, step-by-step implementations and its notebook link will be mentioned and attached.

Contents

  1. Bag-of-words
  2. Step-by-step BoW Implementation
  3. TF-IDF
  4. TF-IDF Implementation
  5. Recap
  6. References

Bag-of-words

As machine learning cannot handle texts as the input to their algorithms directly, so texts need transforming into numeric tensors. This process is known as vectorization or feature extraction. In this section, Bag-of-words, which is the one of the text representations, is explained here.

The following figure is the result that we are going to achieve. The left table is the input text and the right counterpart is its encoded vectors.

Fig 1: Bag-of-words representation

Bag-of-words can be used to represent texts for machine learning problems such as document classification, sentiment analysis etc.

It has a bag of unique words assigned with numbers but without the orders. The following visualization makes it more intuitive.

BoW model does not carry the word order and the meaning but they carry the counts of words or their frequencies shown in the above figure. After fitting and transforming the BoW model, it can be used to encode the new texts.

Source: Google Image

For example, if a new text needs to be vectorized, then each word of that text will be searched in the BoW’s dictionary and if it is found, its frequency is assigned to the respective position.

Therefore, dictionary should contain words as much as possible so that they can be encoded. But there is a trade-off that the vector size will be sparse if there are tens of thousands of words in the dictionary that will lead to high computing cost when training the model.

Fig 2: Dictionary
Fig 3: BoW Representation for the new text

Let’s check the above figure. When the new text is vectorized, BoW model simply drops out the words Venus and hottest because they do not contain in the dictionary. So, the BoW model just pays attention on other words to assign their frequencies at the respective columns.

Step-by-step BoW Implementation

Before we start to dive in this section, please keep in mind that there are some libraries for the BoW implementation. The aim of step-by-step implementation is to explain each step of the BoW model. Okay, let’s start!

So we are going to vectorize sample corpus into numeric vectors using python step-by-step. There are other ways to vectorize texts such as scikit-learn’s CountVectorizer. For your information, the link of complete notebook is added at the end of this article.

  1. Tokenize corpus
  2. Standardize corpus
  3. Build vocabulary
  4. Encode text

Tokenize corpus

Take the following data as the sample corpus. We have two sentences in the corpus. Firstly, it needs splitting the text into tokens such as character-level, word-level or n-gram tokens. This process is called tokenization. In this case, the corpus will be tokenized at word level.

# Make a corpus
corpus = [
"Jupiter is the gas giant planet.",
"Neptune is the ice giant planet."
]

Standardize Corpus

Then it needs normalizing — the text data should be standardized by converting into lower or upper case, removing punctuations, and applying stemming or lemmatization. In this case, we will convert the texts into their lower form and discard the punctuations.

# Standardize text data
corpus = [
"jupiter is the gas giant planet.",
"neptune is the ice giant planet."
]
# Tokenize text data, remove punctuations
corpus = [
["jupiter", "is", "the", "gas", "giant", "planet"],
["neptune", "is", "the", "ice", "giant", "planet"]
]

Build Vocabulary

We need to create our own dictionary so that we can encode our texts into vectors of numeric values. Here is the code.

Encode Text

Now, we can encode our texts into numeric vectors with the help of the dictionary. Let’s look at the following code.

That’s it! Now we can encode texts corpus into the BoW model. Using sklearn’s CountVectorizer is easy to use with less lines of code and you can check it in the attached notebook.

TF-IDF

In the BoW model, it only counts the frequencies of terms in each document —the more the term appears, the more important the term. But it can be biased because there are words such as the, a, is that appear many times within a single document and they cannot make the sentence more meaningful and important.

In TF-IDF, there are two parts — term frequency and inverse document frequency. Term frequency (TF) part is the same as the BoW model. It computes the frequency or importance of terms in the specific document. It is then normalized with the IDF value which computes the importance of term across all documents in the corpus.

Fig 4: TFIDF Representation for the new text

Let’s check the below corpus and its TFIDF vectors, the words — is , the , giant , planet appear in every document in the corpus. They are really common and can no longer make the documents stand out. However, the words jupiter and gas are rare words and make the first document unique. Likewise, the words — neptune and ice make the second document superior since they do not appear in the first document.

# Recall the corpus 
corpus = [
"Jupiter is the gas giant planet.",
"Neptune is the ice giant planet."
]
# TFIDF values
gas giant ice is jupiter neptune planet the
0 0.067578 0.0 0.000000 0.0 0.067578 0.000000 0.0 0.0
1 0.000000 0.0 0.067578 0.0 0.000000 0.067578 0.0 0.0

Therefore, in TF-IDF, it calculates the importance of terms for each document and normalizes with the importance of terms across the documents. The more rare the term, the more interesting the term. Here is the equations.

TFIDF equations

Take that we have 5 documents in the corpus and each document has 5 vocabularies as an example. According to the below tables, the value of term a gets larger when it is appeared many times in the single document, D1. That is how BoW works.

But the IDF value of term a gets smaller when it is found in many documents. This is how TFIDF normalizes the value of term a by multiplying BoW or TF with IDF value.

Fig 5: BoW and TFIDF examples
Fig 6: Comparison of BoW and TFIDF

TF-IDF Implementation

Once again, there are some efficient ways to vectorize texts such as scikit-learn's TfidfVectorizer that we can use it easily. But we can still explore every step of TFIDF model in this section.

So, we are about to implement TFIDF using python with the following steps.

  1. Prepare, Standardize and Tokenize Texts
  2. Term Frequency
  3. Inverse Document Frequency
  4. TF * IDF
  5. Encode Texts

Prepare, Standardize and Tokenize Texts

This step is the same as the step from the BoW model. After collecting a corpus, we need to standardize the corpus and then it needs tokenizing and builds the dictionary.

Term Frequency

Term frequency indicates how the term is important in the specific document. It is calculated as the frequencies of term in the specific document divided by the total words of that document.

Inverse Document Frequency

IDF indicates the importance of terms across the documents. If the term appears in every document, it can be considered as not important term when it is compared to the rare words which make the documents stand out. IDF is the log value of the result of total number of documents in the corpus divided by the number of documents with the specific term.

TF * IDF

Now, TFIDF value for each term can be computed thanks to TF and IDF functions implemented above. TFIDF is the product of term frequency(TF) and inverse document frequency(IDF). It can be implemented as follows.

Encode Text

Now, we can encode our texts into numeric vectors with the help of the above function. Let’s look at the following code.

Note that our TFIDF vectors will be different to others especially Sklearn’s TfidfVectorizer because they use the different smoothing approach and they also modify the IDF equations so that common words are not be entirely ignored. This can be checked here.

Recap

There are ways to vectorize texts into numeric tensors including BoW, TFIDF and the SOTA approaches such as word embeddings that did not cover in this article. BoW and TFIDF are still worth to know it as the hello-world approaches to feature extraction for the text problems.

Yes, this is the end of this article. I hope you can now vectorize your texts for your machine learning problems. You can also access the following notebook. Thanks for your time. Please feel free to leave any suggestions. Thanks and Good luck!

References

--

--