feature_extraction.CountVectorizer

The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.

Usage

import { CountVectorizer } from 'machinelearn/feature_extraction';

const corpus = ['deep learning ian good fellow learning jason shin shin', 'yoshua bengio'];
const vocabCounts = cv.fit_transform(corpus);
console.log(vocabCounts); // [ [ 0, 1, 1, 1, 1, 1, 2, 2, 0 ], [ 1, 0, 0, 0, 0, 0, 0, 0, 1 ] ]
console.log(cv.vocabulary); // { bengio: 0, deep: 1, fellow: 2, good: 3, ian: 4, jason: 5, learning: 6, shin: 7, yoshua: 8 }
console.log(cv.getFeatureNames()); // [ 'bengio', 'deep', 'fellow', 'good', 'ian', 'jason', 'learning', 'shin', 'yoshua' ]

const newVocabCounts = cv.transform(['ian good fellow jason duuog']);
console.log(newVocabCounts); // [ [ 0, 0, 1, 1, 1, 1, 0, 0, 0 ] ]

Properties

Methods

Properties


▸ vocabulary

Defined in feature_extraction/text.ts:26

Methods


λ fit

Learn a vocabulary dictionary of all tokens in the raw documents.

Defined in feature_extraction/text.ts:35

Parameters:

ParamTypeDefaultDescription
docstring[]nullAn array of strings

Returns:

this

λ fit_transform

fit transform applies

Defined in feature_extraction/text.ts:46

Parameters:

ParamTypeDefaultDescription
docstring[]nullAn array of strings

Returns:

number[][]

λ getFeatureNames

Array mapping from feature integer indices to feature name

Defined in feature_extraction/text.ts:70

Returns:

object

λ transform

Transform documents to document-term matrix. Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.

Defined in feature_extraction/text.ts:61

Parameters:

ParamTypeDefaultDescription
docstring[]nullAn array of strings

Returns:

number[][]