Table of Contents
Naive Bayes Classification
1 Summary
Naive Bayes is a modeling assumption used in classification, where we assume the observed data are conditionally independent given their class assignments. Despite its name, the standard naive Bayes model does not use Bayesian inference, but rather, a maximum likelihood estimation.
2 Probabilistic model
Abstractly, naive Bayes is a conditional probability model: given a problem instance to be classified, represented by a vector \[{\displaystyle \mathbf {x} =(x_{1},\dots ,x_{n})}\] representing some n features (independent variables), it assigns to this instance probabilities
\[{\displaystyle p(C_{k}\mid x_{1},\dots ,x_{n})\,}\] for each of K possible outcomes or classes \[{\displaystyle C_{k}} C_{k}\].
The problem with the above formulation is that if the number of features n is large or if a feature can take on a large number of values, then basing such a model on probability tables is infeasible. We therefore reformulate the model to make it more tractable. Using Bayes' theorem, the conditional probability can be decomposed as
\[{\displaystyle p(C_{k}\mid \mathbf {x} )={\frac {p(C_{k})\ p(\mathbf {x} \mid C_{k})}{p(\mathbf {x} )}}\,}\] In plain English, using Bayesian probability terminology, the above equation can be written as
\[{\displaystyle {\mbox{posterior}}={\frac {{\mbox{prior}}\times {\mbox{likelihood}}}{\mbox{evidence}}}\,}\] In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on {\displaystyle C} C and the values of the features {\displaystyle xi} xi are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model
\[{\displaystyle p(C_{k},x_{1},\dots ,x_{n})\,}\] , which can be rewritten as follows, using the chain rule for repeated applications of the definition of conditional probability:
\[{\displaystyle {\begin{aligned}p(C_{k},x_{1},\dots ,x_{n})&=p(x_{1},\dots ,x_{n},C_{k})\\&=p(x_{1}\mid x_{2},\dots ,x_{n},C_{k})p(x_{2},\dots ,x_{n},C_{k})\\&=p(x_{1}\mid x_{2},\dots ,x_{n},C_{k})p(x_{2}\mid x_{3},\dots ,x_{n},C_{k})p(x_{3},\dots ,x_{n},C_{k})\\&=\dots \\&=p(x_{1}\mid x_{2},\dots ,x_{n},C_{k})p(x_{2}\mid x_{3},\dots ,x_{n},C_{k})\dots p(x_{n-1}\mid x_{n},C_{k})p(x_{n}\mid C_{k})p(C_{k})\\\end{aligned}}}\] Now the "naive" conditional independence assumptions come into play: assume that each feature {\displaystyle xi} xi is conditionally independent of every other feature {\displaystyle xj} xj for {\displaystyle j≠ i} j≠ i, given the category {\displaystyle Ck} Ck. This means that
\[{\displaystyle p(x_{i}\mid x_{i+1},\dots ,x_{n},C_{k})=p(x_{i}\mid C_{k})\,}\]. Thus, the joint model can be expressed as
\[{\displaystyle {\begin{aligned}p(C_{k}\mid x_{1},\dots ,x_{n})&\varpropto p(C_{k},x_{1},\dots ,x_{n})\\&\varpropto p(C_{k})\ p(x_{1}\mid C_{k})\ p(x_{2}\mid C_{k})\ p(x_{3}\mid C_{k})\ \cdots \\&\varpropto p(C_{k})\prod _{i=1}^{n}p(x_{i}\mid C_{k})\,.\end{aligned}}}\] Where \[{\displaystyle \varpropto }\] denotes proportionality.
This means that under the above independence assumptions, the conditional distribution over the class variable {\displaystyle C} C is:
\[{\displaystyle p(C_{k}\mid x_{1},\dots ,x_{n})={\frac {1}{Z}}p(C_{k})\prod _{i=1}^{n}p(x_{i}\mid C_{k})}\] where the evidence \[{\displaystyle Z=p(\mathbf {x} )=\sum _{k}p(C_{k})\ p(\mathbf {x} \mid C_{k})}\] is a scaling factor dependent only on \[{\displaystyle x_{1},\dots ,x_{n}}\], that is, a constant if the values of the feature variables are known.