Logistic Regression and Naive Bayes

In this section we introduce the Logistic Regression and Naive Bayes classifiers. We discuss general guidelines for when to use each. Finally, we show how to implement Logistic Regression and Naive Bayes using SciKit-Learn. By the end of this section, you should be able to:

Describe how the logistic regression and Naive Bayes classifier models work at a high level
Describe when each model is most appropriate to use (and why)
Implement each using the SciKit-Learn package

Logistic Regression

In this section, we introduce the Logistic Regression model. As with the other methods in this unit, we will not cover all details but instead will give just a basic sense of the ideas involved.

The basic idea with Logistic Regression is to build upon the Linear Regression model with the goal of learning a probability distribution function that can be used for classification problems. Despite “regression” appearing in the name, logistic regression models are used for classification problems.

In Logistic Regression, we build a linear regression model and then pass the result through a “logistic” function. The logistic function has the form:

\[p(x) = \frac{1}{1 + e^{-k(x-x_0)}}\]

where \(k, x_0\) are constants/parameters with \(k>0\).

Note the following attributes of this function:

As \(x\to -\infty\), \(e^{-k(x-x_0)} \to \infty\) and thus \(p(x) \to 0\)
As \(x\to\infty\), \(e^{-k(x-x_0)} \to 0\) and thus \(p(x) \to 1\)
For \(x:= x_0\), \(e^{-k(x-x_0)} = 1\) and thus \(p(x_0) = 0.5\)

As a result, the logistic function can be thought of as mapping an arbitrary real number to a probability, i.e., a value between 0 and 1.

Example: Diabetes vs Glucose

For simplicity, let’s assume we have a binary classification problem with just one independent variable. For example, suppose we are trying to predict whether a person has diabetes based only on their glucose level. We know the higher a person’s glucose, the more likely they are to have diabetes.

The idea is that, in this case, we can model the probability that the individual has diabetes as a logistic function of their glucose level. It might look similar to the following:

Example logistic regression plot for diabetes vs glucose

All logistic functions have an “S shaped curve”, similar to the shape to the curve above. In logistic regression, the model learns a set of linear coefficients corresponding to each of the independent variables, just as in the case of linear regression.

As in the case of linear regression, we can define a loss function (or error function) and use it to define a cost function which we can then minimize using an algorithm such as gradient descent.

Logistic Regression in SciKit-Learn

The SciKit-Learn package provides the LogisticRegression class from the sklearn.linear_model module.

Let’s use this session to develop a logistic regression model for the cancer dataset we looked at in the hands-on lab.

We’ll begin by importing the required libraries, as usual:

>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.datasets import load_breast_cancer

And then load the data and create our train/test split:

>>> data = load_breast_cancer()
>>> X = data.data
>>> y = data.target

>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=1)

We can now use the LogisticRegression model. We proceed in a similar way as with other models. We pass the following arguments to the LogisticRegression constructor:

max_iter=1000 – This is the maximum number of iterations the solver will use for converging. The default is 100, but here we pass 1000 to give it more time to converge.
random_state=1 – This is used to shuffle the data. (Same as with the SGD Linear Classifier)

We also introduce the convenience function classification_report from the sklearn.metrics module. This function produces a nice report of several measures we have looked at, including accuracy, recall, precision, and F1-score.

Keep in mind when reading the output of classification_report that the values for precision, recall, F1-score, and support are provided for all target class labels. This could cause confusion. We have defined these metrics essentially for the target class equal to 1. For simplicity, you can just ignore the class 0 values. In this context, “support” refers to how many samples are in each class.

>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.metrics import classification_report

>>> # fit the LG model -- random_state is used to shuffle the data; max_iter is max # of iterations for solver to converge (default is 100)
>>> model = LogisticRegression(random_state=1, max_iter=1000).fit(X_train, y_train)

>>> # print the report
>>> print(f"Performance on TEST\n*******************\n{classification_report(y_test, model.predict(X_test))}")
>>> print(f"Performance on TRAIN\n********************\n{classification_report(y_train, model.predict(X_train))}")

Performance on TEST
*******************
              precision    recall  f1-score   support

          0       0.95      0.92      0.94        64
          1       0.95      0.97      0.96       107

    accuracy                           0.95       171
  macro avg       0.95      0.95      0.95       171
weighted avg       0.95      0.95      0.95       171

Performance on TRAIN
********************
              precision    recall  f1-score   support

          0       0.96      0.94      0.95       148
          1       0.96      0.98      0.97       250

    accuracy                           0.96       398
  macro avg       0.96      0.96      0.96       398
weighted avg       0.96      0.96      0.96       398

The performance we see on the cancer dataset is quite good, with:

Precision: 95% on test; 96% on train.
Recall: 97% on test; 98% on train.
F1-score: 96% on test; 97% on train.
Accuracy: 95% on test; 96% on train.

Additional Attributes of the `LogisticRegression` Model

The LogisticRegression model has properties that correspond to those in the LinearRegression model. For example:

coef_ – These are the coefficients of the linear model, one for each independent variable.
intercept_ – This is the y-intercept of the linear model.
decision_function() – This function computes the linear combination of the coefficients and intercept on the input value.

Examples:

>>> model.coef_
array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]], shape=(569, 30))

>>> model.intercept_
array([18.82533262])

>>> model.decision_function(X[0:30])
array([-37.50155217, -14.75203776, -17.05788567,   0.57214522,
       -11.22699773,  -1.45778081, -13.02507044,  -4.68089914,
        -2.62129355,  -6.41207649,  -6.23024417, -11.46582777,
       -12.27306369,   1.03957452,  -2.37747547,  -7.81454901,
        -7.76769613, -12.48215485, -34.0382094 ,   4.16659662,
         5.32451617,  11.01281779,  -4.67487841, -32.61813604,
       -34.72687955, -19.24160889,  -4.98111518, -11.77709786,
       -15.73240047,  -4.85708105])

>>> # Compute the dot product and add the intercept "by hand"
>>> # Note: output agrees with first output from decision function above.
>>> np.sum(model.coef_*X[0:1]) + model.intercept_
array([-37.50155217])

>>> # Predict the first 30 samples; note that the prediction agrees with the sign
>>> # of the decision function.
>>> model.predict(X[0:30])
array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0])

Logistic Regression: Strengths and Weaknesses

Here we try to summarize the strengths and weaknesses of the Logistic Regression model. Keep in mind, these are general statement that tend to apply to most datasets.

Logistic Regression Strengths

Easy to understand and interpret: Logistic Regression models tend to be relatively easy to understand and interpret, as they produce probabilities that are foundational in statistics.
Overfitting is usually avoidable: A number of techniques, such as regularization, enable logistic regression models to avoid overfitting.

Logistic Regression Weaknesses

Cannot learn non-linear decision boundaries: Like all linear models, the big disadvantage here is that logistic regression models cannot learn non-linear decision boundaries. For many real-world datasets, this is a show stopper.
Required data preparation: Logistic regression requires significant data preparation to perform well, even in the best cases. Examples include normalization and scaling. These can be time consuming and computationally expensive on large datasets. We will look at some of these techniques in the next module.

Naive Bayes

The next type of ML model we will discuss is the Naive Bayes model.

This model is based on a simple (i.e., “naive”) assumption that that feature variables in the dataset are pair-wise conditionally independent, meaning that, given two variables, knowing the value of one variable does not provide any information about the value of the other. For example, the following pairs of variables could be considered independent:

student height and course grade

car color and car fuel efficiency

petal length and petal color

On the other hand, the following variables are unlikely to be independent:

petal length and stem length

student height and weight

car model and car fuel efficiency

Note

The above notion of conditional independence can be made into a mathematically precise definition, but we will not go into those details here.

Note that Naive Bayes may still be of some use even in cases where the assumption of independence may not hold.

The assumption of Naive Bayes allows us to write down a simple equation:

\[P(y| x_1, ..., x_n) \sim P(y) \prod_{i=1}^n P(x_i | y)\]

where the notation \(P(y| x)\) can be read as “the probability of y given x”. For a supervised learning classification problem, the \(y\) here represents some possible target class label.

Note that the left hand side of the equation is the thing we are trying to model in any machine learning problem. We usually don’t have an easy formula for it.

But this equation says that the probability of the thing we care about — i.e., the conditional probability of our dependent variable, \(y\), given the independent variables \(x_1, ..., x_n\) — is proportional to the the product of the individual conditional probabilities, \(P(x_i| y)\), and the probability of y itself. Those are much simpler objects to work with.

For example, thinking of y as some target class label, \(P(y)\) is then just the frequency of occurrences of that label in the training set, which is trivial to compute (just count up the number of occurrences and divide by the total size of the dataset).

Similarly, \(P(x_i|y)\) is just the frequency of occurrences of \(x_i\) when restricting to the subset of records with target label \(y\). When \(x_i\) is a categorical feature, this is straight-forward: it is literally just the fraction of occurrences in the subset of the rows of the dataset that have target class \(y\).

When \(x\) is a continuous variable, something more is needed — essentially we require a way of computing likelihoods for a continuous feature. That in turn requires some additional assumptions, for instance, that the continuous feature variables are sampled from a Gaussian (i.e., “normal”) distribution. With an assumption like that in place (and a little bit of Calculus), we can compute the probabilities.

Deriving all the equations is actually fairly involved and would take much more time than we want to spend on it, but hopefully this gives you a general sense of the ideas involved.

Types of Naive Bayes Models

There are several types of Naive Bayes Models. Here we mention just a few:

Gaussian Naive Bayes: Can be used for classification problems involving datasets with continuous variables. In addition to the “naive” assumption of conditional independence, the model makes the assumption that the continuous features are sampled from a Gaussian (i.e., normal) distribution.
Multinomial Naive Bayes: This model is good for discrete feature variables. It has found good use in text classification problems, where the goal is to classify an article by type (e.g., “Biology”, “Computer Science”, “Mathematics”) or sentiment analysis (e.g., classifying social media responses to advertisement campaigns as either “liking” or “not liking” the ad). In this case, the independent variables consist of word count vectors, i.e., the number of times a specific word occurs in the text.
Bernoulli Naive Bayes: This model assumes each feature is binary-valued (i.e., 0 or 1). Like Multinomial Naive Bayes, this model can be used on text classification problems. Instead of using word count vectors, word occurrence vectors are used,

All of these types and others are supported by the SciKit-Learn Naive Bayes classifier.

Naive Bayes in SciKit-Learn

Let’s return to our Iris dataset and use Naive Bayes. As with the previous models, the pattern will be similar. For expediency, we do not discuss in detail the data analysis and pre-processing. For details, see our original discussion of the Iris dataset in the linear classification section.

To begin, we import libraries, load and split the dataset:

>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split

>>> X, y = load_iris(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

We’ll use the GaussianNB class from the sklearn.naive_bayes module. This class implements a Gaussian Naive Bayes algorithm, as described above. We can instantiate the constructor without passing any arguments:

>>> from sklearn.naive_bayes import GaussianNB
>>> gnb = GaussianNB()
>>> y_pred = gnb.fit(X_train, y_train).predict(X_test)

As before, we’ll use classification_report to report the performance:

>>> from sklearn.metrics import classification_report
>>> print(classification_report(y_test, y_pred))

            precision    recall  f1-score   support

        0       1.00      1.00      1.00        11
        1       0.93      1.00      0.96        13
        2       1.00      0.83      0.91         6

accuracy                            0.97        30
macro avg       0.98      0.94      0.96        30
weighted avg    0.97      0.97      0.97        30

Naive Bayes: Strengths and Weaknesses

Here we try to summarize the strengths and weaknesses of the Naive Bayes model. Keep in mind, these are general statement that tend to apply to most datasets.

Naive Bayes Strengths

Conceptually easy: Like, Logistic Regression, the Naive Bayes model is conceptually relatively easy to understand and implement.
Good scaling: Naive Bayes tends to be faster and more efficient to implement than Logistic Regression, and requires less storage.
Good in high dimensions: Naive Bayes can work better with high dimensional data (e.g., text classification) than other classifiers.

Naive Bayes Weaknesses

Poor accuracy when assumptions fail: When the pair-wise conditional independence assumption fails, the performance of Naive Bayes classifiers can suffer.
Zero frequency issue: Given that the probabilities are multiplied together in the equation above, Naive Bayes suffers from the “zero frequency issue” where, if some class value does not appear in the training set, its probability formally is 0, which causes the entire expression to be 0. In practice, there do exist techniques to handle this issue, but they add complexity.

Logistic Regression and Naive Bayes