As frequency of that feature: number of times

As a second classifier for this task Multinomial Naive Bayes (MNB) was chosen
for the comparison. Interestingly enough, even though MNB is fairly simpler
compared to other available classifiers, it was used in many projects and documented
in many research papers as the classifier that gives relatively good
results while being efficient and computationally inexpensive 11, 28).
MNB is based on the Bayes assumption, which considers all the features to
be independent of each other. Even though this is incorrect and the features
are actually dependent, it was shown to work well in practice without any good
explanation 11. That is why this assumption is called “naive”. In order to understand
how Naive Bayes classifiers work (MNB is only one type), it is needed
to understand the Bayes’ theorem:What we are trying to calculate is the probability of having wj when xi happened
(posterior probability), where wj
is a class (j1,2,…,m) and x is a vector
of features (i 1,2,…,n) 28. P(xi
|wj) is the probability of having xi given that it
belongs to the class wj (conditional probability), P(wj) is the probability of the
class (prior probability), while P(xi) is the probability of the features independent
of the class (evidence) 28. With this formula what we are trying to do is
to maximize the posterior probability given the training data so as to formulate
the decision rule.28
In order to calculate the conditional probability we go back to the Bayes’
assumption that the features are independent, that is, that the change of probability
of one feature will not cause any effect to the probability of some other
feature 28. Under this assumption, we can calculate conditional probabilities
of the samples from the training data. Given the vector x, we calculate the
conditional probability given the following formula: that is, by multiplying the probability of each feature given the class wj
. This
is simply a frequency of that feature: number of times the feature appears in
the class wj divided by the the count of all the features of that class) 28. Probability
of the prior P(wj) refers to the “prior knowledge” we have about our data,
in general how probable is to encounter a class wj
in our training data. It is
calculated by dividing the number of times wj appeared in our data by the total
count of classes 28.
Probability of the evidence P(x) is the probability of a particular vector of features
appearing independent of the class. In order to calculate the evidence,
we use this formula:
where not wj refers to the occurrences when a particular set of features is
not encountered in the class wj
28.
When working with classifying text, we will usually encounter in the test set
words that we did not have before in the training set. This would result in the
class conditional probability being zero, which would then turn all the result into
zero, since the class prior is multiplied by the conditional probability. The way
which this problem is resolved is by using a smoothing technique, in our case,
Laplace correction. This means that we add +1 to the numerator and the size
of our vocabulary to the denominator as follows:
Nxi
,wj
is the number of times feature xi appears in samples from class wj
,
while Nwj
is the total count of all features in class wj