reading-notes

Class 13: Linear Regressions

How to Run Linear Regression in Python

Scikit-learn is a powerful Python module for machine learning. It contains function for regression, classification, clustering, model selection and dimensionality reduction. The sklearn.linear_model module contains “methods intended for regression in which the target value is expected to be a linear combination of the input variables”.

All examples are using the Boston Housing data set, the data set contains information about the housing values in suburbs of Boston.

This information is also available at the UCI Machine Learning Repository.

Exploring Boston Housing Data Set

The first step is to import the required Python libraries into Jupyter Notebook.

%matplotlib inline

import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn

This data set is available in sklearn Python module, so you access it using scikitlearn.

Important functions to keep in mind while fitting a linear regression model are:

.coef_ gives the coefficients and .intercept_ gives the estimated intercepts.

Training and validation data sets

In practice you wont implement linear regression on the entire data set, you will have to split the data sets into training and test data sets. So that you train your model on training data and see how well it performed on test data.

How to do train-test split:

You have to divide your data sets randomly. Scikit learn provides a function called train_test_split to do this.

Residual Plots

Residual plots are a good way to visualize the errors in your data. If you have done a good job then your data should be randomly scattered around line zero. If you see structure in your data, that means your model is not capturing some thing. Maye be there is a interaction between 2 variables that you are not considering, or may be you are measuring time dependent data. If you get some structure in your data, you should go back to your model and check whether you are doing a good job with your parameters.


Residual Plots Example

Recap what was done in this article:

Additional Resources