Learn Scikit-learn in One Article

What is Scikit-learn?

Scikit-Learn, also known as "sklearn" is a free, open-source Python machine learning library. It’s a simple but very efficient tool for machine learning, data analysis and data mining. It has many different features such as machine learning algorithms. It also supports Python’s numerical and scientific libraries (NumPy and SciPy). 

Importing Scikit-learn

As all Python libraries, scikit-learn needs to be imported before used

import sklearn

Preprocessing

Data preprocessing is the process of converting raw dataset into a meaningful and clean dataset. This technique must be followed before using dataset in machine learning algorithms. There are three main steps data preprocessing. 
  • Data loading
  • Data splitting
  • Data Preparation

1. Data Loading

Data need to be in numeric form stored in numeric arrays. There are the two famous ways you can use to load data.

Using NumPy

import numpy as np
data = np.loadtxt('file_name.csv', delimiter=',')

Using Pandas 

import pandas as pd
df=pd.read_csv(‘file_name.csv’,header=0)

2. Data Splitting

The second step is to split data into training dataset and testing dataset

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0)

3. Data Preparation:

Standardization

It is the process of converting data into a common format in order to solve the problem of the large differences.

from sklearn.preprocessing import StandardScaler
get_names = df.columns
scaler = preprocessing.StandardScaler()
scaled_df = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_df, columns=get_names)

Normalization

It is the process of translating data into a specific range to make training less sensitive because of the scale of features and to make data better conditioned for convergence 

from sklearn.preprocessing import Normalizer
pd.read_csv("file_name.csv")
x_array = np.array(df[‘Column1’])# Normalizing Column1
normalized_X = preprocessing.normalize([x_array])

Working on a model

After making the necessary preprocessing for the dataset, we need to work on the model. Choosing the right model that can represent the dataset will help making the kind of predictions we want from the chosen dataset and then performing model fitting.

Model Choosing

Supervised Learning Estimators:

Supervised learning is a type of machine learning where we supervise the results by training the model with labeled dataset

  • Linear Regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression(normalize=True)

  • Support Vector Machine (SVM)
from sklearn.svm import SVC
svc = SVC(kernel='linear')

  • Naive Bayes
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB() 

  • K-Nearest Neighbors (KNN)
from sklearn import neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=1)

Unsupervised Learning Estimators

Unsupervised learning is a type of machine learning where we train the model with non labeled data or non classified data and let the algorithm do all the work on that dataset without any supervision.
  • Principal Component Analysis (PCA):
from sklearn.decomposition import PCA
pca= PCA(n_components=0.95)

  • K Means
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=5, random_state=0)

Model Fitting

The purpose to fit a model is to know how well a model will be generalized when trained with a dataset similar to the dataset that the model was initially trained on.

Supervised

lr.fit(X_train, y_train)
svc.fit(X_train, y_train)
gnb.fit(X_train, y_train)
knn.fit(X_train, y_train)

Unsupervised

k_means.fit(X_train)
pca.fit_transform(X_train)

Post-processing

After preparing data and training the mode, the following step is to finally come to the main goal of machine learning algorithms which is to make predictions and evaluate results.

Prediction

Once the model has been chosen and fitted,  We finally can make predictions on the dataset.

Supervised

y_predict = lr.predict(X_test)
y_predict = svc.predict(X_test)
y_predict = gnb.predict(X_test)
y_predict = knn.predict_proba(X_test)

Unsupervised

y_pred = k_means.predict(X_test)

Evaluation

Evaluating the model performance is vey necessary. There are many techniques in machine learning that can be used to evaluate models and visualize their performance.

Classification

  • Confusion Matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))

  • Accuracy Score
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

Regression

  • Mean Absolute Error
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)

  • Mean Squared Error
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)

  • R² Score
from sklearn.metrics import r2_score
r2_score(y_true, y_pred)

Clustering

  • Homogeneity
from sklearn.metrics import homogeneity_score
homogeneity_score(y_true, y_pred)

  • V-measure
from sklearn.metrics import v_measure_score
metrics.v_measure_score(y_true, y_pred)

  •  Cross-validation
from sklearn.cross_validation import cross_val_score
print(cross_val_score(knn, X_train, y_train, cv=4))
print(cross_val_score(lr, X_train, y_train, cv=2))

Model Tuning

Model tuning allows to customize models so they generate the most accurate results.  This is done by searching for the right set of parameters. There are two main ways to do that:

Grid Search

In Grid search, parameter tuning is done methodically. That generates candidates from a grid of parameter values specified with the param_grid parameter.

from sklearn.grid_search import GridSearchCV
params = {"n_neighbors": np.arange(1,3), "metric": ["euclidean", "cityblock"]}
grid = GridSearchCV(estimator=knn, param_grid=params)
grid.fit(X_train, y_train)
print(grid.best_score_)
print(grid.best_estimator_.n_neighbors)

Randomized Search

In Randomized Search, random combinations of hyperparameters are selected and used to train a model.

from sklearn.grid_search import RandomizedSearchCV
params = {"n_neighbors": range(1,5), "weights": ["uniform", "distance"]}
rs = RandomizedSearchCV(estimator=knn, param_distributions=params, cv=4, n_iter=8, random_state=5)
rs.fit(X_train, y_train)
print(rs.best_score_)