Learn Scikit-learn in One Article

What is Scikit-learn?

Scikit-Learn, also known as "sklearn" is a free, open-source Python machine learning library. It’s a simple but very efficient tool for machine learning, data analysis and data mining. It has many different features such as machine learning algorithms. It also supports Python’s numerical and scientific libraries (NumPy and SciPy).

Importing Scikit-learn

As all Python libraries, scikit-learn needs to be imported before used

import sklearn

Preprocessing

Data preprocessing is the process of converting raw dataset into a meaningful and clean dataset. This technique must be followed before using dataset in machine learning algorithms. There are three main steps data preprocessing.

Data loading
Data splitting
Data Preparation

1. Data Loading

Data need to be in numeric form stored in numeric arrays. There are the two famous ways you can use to load data.

Using NumPy

import numpy as np

data = np.loadtxt('file_name.csv', delimiter=',')

Using Pandas

import pandas as pd

df=pd.read_csv(‘file_name.csv’,header=0)

2. Data Splitting

The second step is to split data into training dataset and testing dataset

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0)

3. Data Preparation:

Standardization

It is the process of converting data into a common format in order to solve the problem of the large differences.

from sklearn.preprocessing import StandardScaler

get_names = df.columns

scaler = preprocessing.StandardScaler()

scaled_df = scaler.fit_transform(df)

scaled_df = pd.DataFrame(scaled_df, columns=get_names)

Normalization

It is the process of translating data into a specific range to make training less sensitive because of the scale of features and to make data better conditioned for convergence

from sklearn.preprocessing import Normalizer

pd.read_csv("file_name.csv")

x_array = np.array(df[‘Column1’])# Normalizing Column1

normalized_X = preprocessing.normalize([x_array])

Working on a model

After making the necessary preprocessing for the dataset, we need to work on the model. Choosing the right model that can represent the dataset will help making the kind of predictions we want from the chosen dataset and then performing model fitting.

Model Choosing

Supervised Learning Estimators:

Supervised learning is a type of machine learning where we supervise the results by training the model with labeled dataset

Linear Regression

from sklearn.linear_model import LinearRegression

lr = LinearRegression(normalize=True)

Support Vector Machine (SVM)

from sklearn.svm import SVC

svc = SVC(kernel='linear')

Naive Bayes

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

K-Nearest Neighbors (KNN)

from sklearn import neighbors

knn = neighbors.KNeighborsClassifier(n_neighbors=1)

Unsupervised Learning Estimators

Unsupervised learning is a type of machine learning where we train the model with non labeled data or non classified data and let the algorithm do all the work on that dataset without any supervision.

Principal Component Analysis (PCA):

from sklearn.decomposition import PCA

pca= PCA(n_components=0.95)

K Means

from sklearn.cluster import KMeans

k_means = KMeans(n_clusters=5, random_state=0)

Model Fitting

The purpose to fit a model is to know how well a model will be generalized when trained with a dataset similar to the dataset that the model was initially trained on.

Supervised

lr.fit(X_train, y_train)

svc.fit(X_train, y_train)

gnb.fit(X_train, y_train)

knn.fit(X_train, y_train)

Unsupervised

k_means.fit(X_train)

pca.fit_transform(X_train)

Post-processing

After preparing data and training the mode, the following step is to finally come to the main goal of machine learning algorithms which is to make predictions and evaluate results.

Prediction

Once the model has been chosen and fitted, We finally can make predictions on the dataset.

Supervised

y_predict = lr.predict(X_test)

y_predict = svc.predict(X_test)

y_predict = gnb.predict(X_test)

y_predict = knn.predict_proba(X_test)

Unsupervised

y_pred = k_means.predict(X_test)

Evaluation

Evaluating the model performance is vey necessary. There are many techniques in machine learning that can be used to evaluate models and visualize their performance.

Classification

Confusion Matrix

from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, y_pred))

Accuracy Score

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

Regression

Mean Absolute Error

from sklearn.metrics import mean_absolute_error

mean_absolute_error(y_test, y_pred)

Mean Squared Error

from sklearn.metrics import mean_squared_error

mean_squared_error(y_test, y_pred)

R² Score

from sklearn.metrics import r2_score

r2_score(y_true, y_pred)

Clustering

Homogeneity

from sklearn.metrics import homogeneity_score

homogeneity_score(y_true, y_pred)

V-measure

from sklearn.metrics import v_measure_score

metrics.v_measure_score(y_true, y_pred)

Cross-validation

from sklearn.cross_validation import cross_val_score

print(cross_val_score(knn, X_train, y_train, cv=4))

print(cross_val_score(lr, X_train, y_train, cv=2))

Model Tuning

Model tuning allows to customize models so they generate the most accurate results. This is done by searching for the right set of parameters. There are two main ways to do that:

Grid Search

In Grid search, parameter tuning is done methodically. That generates candidates from a grid of parameter values specified with the param_grid parameter.

from sklearn.grid_search import GridSearchCV

params = {"n_neighbors": np.arange(1,3), "metric": ["euclidean", "cityblock"]}

grid = GridSearchCV(estimator=knn, param_grid=params)

grid.fit(X_train, y_train)

print(grid.best_score_)

print(grid.best_estimator_.n_neighbors)

Randomized Search

In Randomized Search, random combinations of hyperparameters are selected and used to train a model.

from sklearn.grid_search import RandomizedSearchCV

params = {"n_neighbors": range(1,5), "weights": ["uniform", "distance"]}

rs = RandomizedSearchCV(estimator=knn, param_distributions=params, cv=4, n_iter=8, random_state=5)

rs.fit(X_train, y_train)

print(rs.best_score_)

Data World

Learn Scikit-learn in One Article

What is Scikit-learn?

Importing Scikit-learn

Preprocessing

1. Data Loading

Using NumPy

Using Pandas

2. Data Splitting

3. Data Preparation:

Standardization

Normalization

Working on a model

Model Choosing

Supervised Learning Estimators:

Unsupervised Learning Estimators

Model Fitting

Supervised

Unsupervised

Post-processing

Prediction

Supervised

Unsupervised

Evaluation

Classification

Regression

Clustering

Model Tuning

Grid Search

Randomized Search

Post a Comment