Hello-World to Text Vectorization for ML problems

Photo by Sven Brandsma on Unsplash

Contents

  1. Bag-of-words
  2. Step-by-step BoW Implementation
  3. TF-IDF
  4. TF-IDF Implementation
  5. Recap
  6. References

Bag-of-words

As machine learning cannot handle texts as the input to their algorithms directly, so texts need transforming into numeric tensors. This process is known as vectorization or feature extraction. In this section, Bag-of-words, which is the one of the text representations, is explained here.

Fig 1: Bag-of-words representation
Source: Google Image
Fig 2: Dictionary
Fig 3: BoW Representation for the new text

Step-by-step BoW Implementation

Before we start to dive in this section, please keep in mind that there are some libraries for the BoW implementation. The aim of step-by-step implementation is to explain each step of the BoW model. Okay, let’s start!

  1. Tokenize corpus
  2. Standardize corpus
  3. Build vocabulary
  4. Encode text
# Make a corpus
corpus = [
"Jupiter is the gas giant planet.",
"Neptune is the ice giant planet."
]
# Standardize text data
corpus = [
"jupiter is the gas giant planet.",
"neptune is the ice giant planet."
]
# Tokenize text data, remove punctuations
corpus = [
["jupiter", "is", "the", "gas", "giant", "planet"],
["neptune", "is", "the", "ice", "giant", "planet"]
]

TF-IDF

In the BoW model, it only counts the frequencies of terms in each document —the more the term appears, the more important the term. But it can be biased because there are words such as the, a, is that appear many times within a single document and they cannot make the sentence more meaningful and important.

Fig 4: TFIDF Representation for the new text
# Recall the corpus 
corpus = [
"Jupiter is the gas giant planet.",
"Neptune is the ice giant planet."
]
# TFIDF values
gas giant ice is jupiter neptune planet the
0 0.067578 0.0 0.000000 0.0 0.067578 0.000000 0.0 0.0
1 0.000000 0.0 0.067578 0.0 0.000000 0.067578 0.0 0.0
TFIDF equations
Fig 5: BoW and TFIDF examples
Fig 6: Comparison of BoW and TFIDF

TF-IDF Implementation

Once again, there are some efficient ways to vectorize texts such as scikit-learn's TfidfVectorizer that we can use it easily. But we can still explore every step of TFIDF model in this section.

  1. Prepare, Standardize and Tokenize Texts
  2. Term Frequency
  3. Inverse Document Frequency
  4. TF * IDF
  5. Encode Texts

Recap

There are ways to vectorize texts into numeric tensors including BoW, TFIDF and the SOTA approaches such as word embeddings that did not cover in this article. BoW and TFIDF are still worth to know it as the hello-world approaches to feature extraction for the text problems.

References

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store