In this article let us look at:

  1. CountVectorizer
  2. Example of how CountVectorizer works
  3. Why the sparse matrix format?
  4. Dataset and imports
  5. Countvectorizer plain and simple
  6. Countvectorizer and stop words
  7. Custom tokenization
  8. Custom preprocessing
  9. Count vectorizer n-grams
  10. Limiting vocabulary size
  11. Ignore counts and use binary values
  12. Using count vectorizer to extract N-Gram or term counts

1. Countvectorizer

Feature extraction or conversion of text data into a vector representation. Text data is pre-presented into the matrix. 

2. Example of how countvectorizer works

Count vectorizer works by converting the book’s title into sparse word depiction with perspectives such as how you visually imagine it to its representation in practice.

The book title for the representation of how count vectorizer works

A glimpse of a book story that kids love.

doc=[“one cent, Two cents, old  cent, New cent: All about Money”]

These 9 different words are represented in 9 columns. Each column depicts each different word in the vocabulary. Every row is for displaying documentation in the data set.   The single book title is represented by one row.  The value in every single cell is the word count.  If the word is not there in the respective document, certain words could be 0. Although the word matrix is represented as given in the figure above, the transformation of words in numerical is represented as a positional index in the sparse matrix.

3. Why the sparse matrix format?

 The format is used to process raw text to a number vector and is presented in words and N-Grams. This helps the algorithm to process the information in machine learning, further classifying and clustering. These algorithms understand number data ( text, image, pixels, numbers, categories, enabling complex machine learning into different data sets.

4. Dataset and imports

 Short texts are used in the applications completely.  The document is made from the title.

 The short texts represent count vectorizer in your applications.

5. Countvectorizer plain and simple

 The 5 book titles are used for preprocessing, tokenization and represented in the sparse matrix as illustrated in the introduction.  The count vectorizer import  has the following default functions

  • In case you do not want a lower casing, use lowercase=false
  • Apply Utf-8 encoding
  • Implementation of tokenization converting raw text into lesser bits of text.
  • Tokenization at the word level is where word meaning is depicted as a separate token.
  • The single characters are ignored during tokenization.

6. Countvectorizer and stop words

 You may want to get rid of stop words as they have constraints of prediction power and is not helpful in text classification. The steps for removing the count vectorizer are as follows:

  • Apply word top list that is customized
  • Generate corpora distinctive stop words using   max_df, and min_df  is suggested for use.            

The shape of the text is modified when the stop word list is removed.

Stop words for min_ df is to eliminate texts of less prominence, in case the name of people may appear in 1 or  2 documents, this may be considered irrelevant, or repetition and maybe unnecessary and is not considered for further analysis.

Stop words using max_df   threshold for revealing the frequency of how many times that particular term appeared in the documents with Max_ Df threshold removed for further consideration if it appears 85% of the time in the document.

7. Custom tokenization

This removes symbols like special characters such as punctuation, characters, single characters.   For retention of the special characters, the count vectorizer is bestowed with a custom tokenizer.

8. Custom preprocessing

 The text is preprocessed and assessed before generating a sparse matrix of terms. Preprocessing helps in eliminating noise or unwanted repetitions with improvement in sparse issues for accurate analysis.

9. Count vectorizer n-grams

Using features for text classification tasks is to use n-grams where n >1, the interpretation here is that unigrams do not capture contextual information as efficiently as bi-grams or tri-grams. For keyword extraction, unigrams are beneficial as the texts represented, should be meaningful like nutritious food rather than food and nutrition, which, when observed separately, generate ambiguity.

10. Limiting vocabulary size

 Vocabulary size is limited when the spatial representation is too large for the feature space.  In case you have the minimum term representation as 10,000 words, the count vectorizer keeps the prominent 10,000 words while eliminating the rest by using n-grams.

11. Ignore counts and use binary values

Count vectorizer uses counts and binary values invariably, however choosing term presence and absence in the place of raw counts. These are beneficial in certain tasks for specified features in the classification of text where the magnitude of occurrence is not significant. The presence of a token in a document represents 1. However, the absence of it is 0, irrespective of the magnitude of its occurrence.

12. Using count vectorizer to extract N-Gram or term counts

 The counts are obtained from bottom to top, and the feature name is extracted and given back with the corresponding counts. Count vectorizer n-grams help decipher contextual information with the use of bi-gram or trigram rather than unigram. 

Conclusion

Countervectorizer is an efficient way for extraction and representation of text features from the text data. This enables control of n-gram size, custom preprocessing functionality, and custom tokenization for removing stop words with specific vocabulary use. The word counts are featured particularly, with alternative schemes like TF-IDF are used for representing features.   Binary representation is highly effective than counts for certain applications.  The algorithmic use for embedded training like word2vec, Bert, and ELMO where a text unit is encoded using fixed-length vector. Hence, the algorithm that understands numerical data is beneficial in machine learning. 

There are no right or wrong ways of learning AI and ML technologies – the more, the better! These valuable resources can be the starting point for your journey on how to learn Artificial Intelligence and Machine Learning. Do pursuing AI and ML interest you? If you want to step into the world of emerging tech, you can accelerate your career with this Machine Learning And AI Courses by Jigsaw Academy.

ALSO READ

SHARE