Hello-World to Text Vectorization for ML problems
Ways of how to represent texts into numeric vectors using Bag-of-words and TFIDF
This article will only cover the fundamental approaches to vectorize texts for the machine learning problems although there are SOTA approaches like word embeddings. For both BoW and TFIDF models, step-by-step implementations and its notebook link will be mentioned and attached.
Contents
- Bag-of-words
- Step-by-step BoW Implementation
- TF-IDF
- TF-IDF Implementation
- Recap
- References
Bag-of-words
As machine learning cannot handle texts as the input to their algorithms directly, so texts need transforming into numeric tensors. This process is known as vectorization or feature extraction. In this section, Bag-of-words, which is the one of the text representations, is explained here.
The following figure is the result that we are going to achieve. The left table is the input text and the right counterpart is its encoded vectors.
Bag-of-words can be used to represent texts for machine learning problems such as document classification, sentiment analysis etc.
It has a bag of unique words assigned with numbers but without the orders. The following visualization makes it more intuitive.
BoW model does not carry the word order and the meaning but they carry the counts of words or their frequencies shown in the above figure. After fitting and transforming the BoW model, it can be used to encode the new texts.
For example, if a new text needs to be vectorized, then each word of that text will be searched in the BoW’s dictionary and if it is found, its frequency is assigned to the respective position.
Therefore, dictionary should contain words as much as possible so that they can be encoded. But there is a trade-off that the vector size will be sparse if there are tens of thousands of words in the dictionary that will lead to high computing cost when training the model.
Let’s check the above figure. When the new text is vectorized, BoW model simply drops out the words Venus
and hottest
because they do not contain in the dictionary. So, the BoW model just pays attention on other words to assign their frequencies at the respective columns.
Step-by-step BoW Implementation
Before we start to dive in this section, please keep in mind that there are some libraries for the BoW implementation. The aim of step-by-step implementation is to explain each step of the BoW model. Okay, let’s start!
So we are going to vectorize sample corpus into numeric vectors using python step-by-step. There are other ways to vectorize texts such as scikit-learn’s CountVectorizer
. For your information, the link of complete notebook is added at the end of this article.
- Tokenize corpus
- Standardize corpus
- Build vocabulary
- Encode text
Tokenize corpus
Take the following data as the sample corpus. We have two sentences in the corpus. Firstly, it needs splitting the text into tokens such as character-level, word-level or n-gram tokens. This process is called tokenization. In this case, the corpus will be tokenized at word level.
# Make a corpus
corpus = [
"Jupiter is the gas giant planet.",
"Neptune is the ice giant planet."
]
Standardize Corpus
Then it needs normalizing — the text data should be standardized by converting into lower or upper case, removing punctuations, and applying stemming or lemmatization. In this case, we will convert the texts into their lower form and discard the punctuations.
# Standardize text data
corpus = [
"jupiter is the gas giant planet.",
"neptune is the ice giant planet."
]# Tokenize text data, remove punctuations
corpus = [
["jupiter", "is", "the", "gas", "giant", "planet"],
["neptune", "is", "the", "ice", "giant", "planet"]
]
Build Vocabulary
We need to create our own dictionary so that we can encode our texts into vectors of numeric values. Here is the code.
Encode Text
Now, we can encode our texts into numeric vectors with the help of the dictionary. Let’s look at the following code.
That’s it! Now we can encode texts corpus into the BoW model. Using sklearn’s CountVectorizer is easy to use with less lines of code and you can check it in the attached notebook.
TF-IDF
In the BoW model, it only counts the frequencies of terms in each document —the more the term appears, the more important the term. But it can be biased because there are words such as the, a, is
that appear many times within a single document and they cannot make the sentence more meaningful and important.
In TF-IDF, there are two parts — term frequency and inverse document frequency. Term frequency (TF) part is the same as the BoW model. It computes the frequency or importance of terms in the specific document. It is then normalized with the IDF value which computes the importance of term across all documents in the corpus.
Let’s check the below corpus and its TFIDF vectors, the words — is
, the
, giant
, planet
appear in every document in the corpus. They are really common and can no longer make the documents stand out. However, the words jupiter
and gas
are rare words and make the first document unique. Likewise, the words — neptune
and ice
make the second document superior since they do not appear in the first document.
# Recall the corpus
corpus = [
"Jupiter is the gas giant planet.",
"Neptune is the ice giant planet."
]# TFIDF values
gas giant ice is jupiter neptune planet the
0 0.067578 0.0 0.000000 0.0 0.067578 0.000000 0.0 0.0
1 0.000000 0.0 0.067578 0.0 0.000000 0.067578 0.0 0.0
Therefore, in TF-IDF, it calculates the importance of terms for each document and normalizes with the importance of terms across the documents. The more rare the term, the more interesting the term. Here is the equations.
Take that we have 5 documents in the corpus and each document has 5 vocabularies as an example. According to the below tables, the value of term a
gets larger when it is appeared many times in the single document, D1. That is how BoW works.
But the IDF value of term a
gets smaller when it is found in many documents. This is how TFIDF normalizes the value of term a
by multiplying BoW or TF with IDF value.
TF-IDF Implementation
Once again, there are some efficient ways to vectorize texts such as scikit-learn's TfidfVectorizer
that we can use it easily. But we can still explore every step of TFIDF model in this section.
So, we are about to implement TFIDF using python with the following steps.
- Prepare, Standardize and Tokenize Texts
- Term Frequency
- Inverse Document Frequency
- TF * IDF
- Encode Texts
Prepare, Standardize and Tokenize Texts
This step is the same as the step from the BoW model. After collecting a corpus, we need to standardize the corpus and then it needs tokenizing and builds the dictionary.
Term Frequency
Term frequency indicates how the term is important in the specific document. It is calculated as the frequencies of term in the specific document divided by the total words of that document.
Inverse Document Frequency
IDF indicates the importance of terms across the documents. If the term appears in every document, it can be considered as not important term when it is compared to the rare words which make the documents stand out. IDF is the log value of the result of total number of documents in the corpus divided by the number of documents with the specific term.
TF * IDF
Now, TFIDF value for each term can be computed thanks to TF and IDF functions implemented above. TFIDF is the product of term frequency(TF) and inverse document frequency(IDF). It can be implemented as follows.
Encode Text
Now, we can encode our texts into numeric vectors with the help of the above function. Let’s look at the following code.
Note that our TFIDF vectors will be different to others especially Sklearn’s TfidfVectorizer because they use the different smoothing approach and they also modify the IDF equations so that common words are not be entirely ignored. This can be checked here.
Recap
There are ways to vectorize texts into numeric tensors including BoW, TFIDF and the SOTA approaches such as word embeddings that did not cover in this article. BoW and TFIDF are still worth to know it as the hello-world approaches to feature extraction for the text problems.
Yes, this is the end of this article. I hope you can now vectorize your texts for your machine learning problems. You can also access the following notebook. Thanks for your time. Please feel free to leave any suggestions. Thanks and Good luck!
References
- Deep Learning with Python, François Chollet
- https://machinelearningmastery.com/gentle-introduction-bag-words-model/
- https://en.wikipedia.org/wiki/Bag-of-words_model
- https://towardsdatascience.com/natural-language-processing-feature-engineering-using-tf-idf-e8b9d00e7e76
- https://en.wikipedia.org/wiki/Tf%E2%80%93idf