Scikit-learn
is a powerful Python module for machine learning. It contains function for regression, classification, clustering, model selection and dimensionality reduction. The sklearn.linear_model module
contains “methods intended for regression in which the target value is expected to be a linear combination of the input variables”.
All examples are using the Boston Housing data set, the data set contains information about the housing values in suburbs of Boston.
This information is also available at the UCI Machine Learning Repository.
The first step is to import the required Python libraries into Jupyter Notebook.
%matplotlib inline
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn
This data set is available in sklearn Python module, so you access it using
scikitlearn
.
Important functions to keep in mind while fitting a linear regression model are:
lm.fit()
-> fits a linear model
lm.predict()
-> Predict Y using the linear model with estimated coefficients
lm.score()
-> Returns the coefficient of determination (R^2). A measure of how well observed outcomes are replicated by the model, as the proportion of total variation of outcomes explained by the model.
.coef_
gives the coefficients and .intercept_
gives the estimated intercepts.
In practice you wont implement linear regression on the entire data set, you will have to split the data sets into training and test data sets. So that you train your model on training data and see how well it performed on test data.
You have to divide your data sets randomly. Scikit learn provides a function called train_test_split to do this.
Residual plots are a good way to visualize the errors in your data. If you have done a good job then your data should be randomly scattered around line zero. If you see structure in your data, that means your model is not capturing some thing. Maye be there is a interaction between 2 variables that you are not considering, or may be you are measuring time dependent data. If you get some structure in your data, you should go back to your model and check whether you are doing a good job with your parameters.
Recap what was done in this article:
Scikit
learn to fit linear regression to the entire data set and calculated the mean squared error.