Intro Codethrough to Machine Learning Classification Algorithms
- Classifier: An algorithm that maps input data to a specific category
- Classification Model: Tries to draw some conclusion from the input values to predict class labels/categories for new data
- Feature: Individual, measurable property or characteristic of whatever is being observed
- Binary Classification: Classification task with only two outcomes (Fraud, No Fraud)
Process
- Split your data
- Initialize the classifier
- Train the classifier
- Predict the target
- Evaluate the model
Train-Test-Split
When training a machine learning algorithm, it’s vital to keep a portion of the data separate, so that the algorithm never sees that small subset of data - this is the testing data that you will use once you’ve trained a model using the training data. Using the testing data last, you can identify how well the model is performing and if the model may be underfitting or overfitting the data.
To prepare your data:
# Split the data
# Target variable
y = pre_df["target"]
# Features
X = pre_df.drop(["target"], axis=1)
# Train-test-split function from scikit-learn
from sklearn.model_selection import train_test_split
# Create your training subset (X_train, y_train) and your testing subset (X_test, y_test)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 47, test_size = .2)
Classifiers
Decision Trees
Decision Trees are a non-parametric supervised learning method used for classification and regression with the goal of creating a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. In other words, a decision tree is a just series of IF-ELSE statements (rules). Each path from the root of a decision tree to one of its leaves can be transformed into a rule simply by combining the decisions along the path to form the antecedent, and taking the leaf’s class prediction as the consequence (IF-ELSE statements follow the form: IF antecedent THEN consequence).
The tree is grown until some stopping criteria is achieved, whether that is the depth of the tree or a similar measure.
from sklearn.tree import DecisionTreeClassifier
# Initialize decision tree classifier
dt = DecisionTreeClassifier()
# Fit the training data (training the classifier here - only fit the training data!)
dt.fit(X_train, y_train)
# Predict the target on the training and testing data
training_preds = dt.predict(X_train)
test_preds = dt.predict(X_test)
# Evaluate the model
# Evaluation metrics may differ depending on the classification problem
accuracy_score(y_test, test_preds)
precision_score(y_test, test_preds)
recall_score(y_test, test_preds)
f1_score(y_test, test_preds)
Logistic Regression
Logistic regression is a linear model that is used for classification and models the probability of one event or class (out of two alternatives) taking place and that the target variable is categorical, e.g., a patient has heart disease (1) or does not have heart disease (0).
from sklearn.linear_model import LogisticRegression
# Initialize the logistic regression classifier and fit the data
lr = LogisticRegression()
# Fit the training data (training the classifier here - only fit the training data!)
lr.fit(X_train, y_train)
# Predict the target on the training and testing data
training_preds = lr.predict(X_train)
test_preds = lr.predict(X_test)
# Evaluate the model
# Evaluation metrics may differ depending on the classification problem
accuracy_score(y_test, test_preds)
precision_score(y_test, test_preds)
recall_score(y_test, test_preds)
f1_score(y_test, test_preds)
K-Nearest Neighbors
The neighbors-based classification is a type of instance-based learning that computes classification from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point. For this modeling, \(k\)-Nearest Neighbors implements learning based on the nearest neighbors of each query point, where \(k\) is an integer value specified by the user.
from sklearn.neighbors import KNeighborsClassifier
# Initialize the KNN classifier and fit the data
knn = KNeighborsClassifier()
# Fit the training data (training the classifier here - only fit the training data!)
knn.fit(X_train, y_train)
# Predict the target on the training and testing data
training_preds = knn.predict(X_train)
test_preds = knn.predict(X_test)
# Evaluate the model
# Evaluation metrics may differ depending on the classification problem
accuracy_score(y_test, test_preds)
precision_score(y_test, test_preds)
recall_score(y_test, test_preds)
f1_score(y_test, test_preds)
Random Forest
Random Forest Classifiers are a type of ensemble method, which means a diverse set of classifiers is created by introducing randomness in the classifier construction; in this case, the prediction of the ensemble is given as the averaged prediction of the individual decision tree classifiers. In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set.
from sklearn.ensemble import RandomForestClassifier
# Initialize the Random Forest classifier and fit the data
rf = RandomForestClassifier()
# Fit the training data (training the classifier here - only fit the training data!)
rf.fit(X_train, y_train)
# Predict the target on the training and testing data
training_preds = rf.predict(X_train)
test_preds = rf.predict(X_test)
# Evaluate the model
# Evaluation metrics may differ depending on the classification problem
accuracy_score(y_test, test_preds)
precision_score(y_test, test_preds)
recall_score(y_test, test_preds)
f1_score(y_test, test_preds)
XGBoost
From its documentation, XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework.
Gradient Boosting alogrithms are a more advanced boosting algorithm that makes use of Gradient Descent. It starts with a weak learner that makes predictions on the dataset. The algorithm then checks this learner’s performance, identifying examples that it got right and wrong. The model then calculates the Residuals for each data point, to determine how far off the mark each prediction was. The model then combines these residuals with a Loss Function to calculate the overall loss.
XGBoost, or eXtreme Gradient Boosting, provides a parallel tree boosting that solve many data science problems in a fast and accurate way. It is a stand-alone library that implements popular gradient boosting algorithms in the fastest, most performant way possible. In fact, XGBoost provides best-in-class performance compared to other classification algorithms.
from xgboost import XGBClassifier
# Initialize the XGBoost classifier and fit the data
xgb = XGBClassifier()
# Fit the training data (training the classifier here - only fit the training data!)
xgb.fit(X_train, y_train)
# Predict the target on the training and testing data
training_preds = xgb.predict(X_train)
test_preds = xgb.predict(X_test)
# Evaluate the model
# Evaluation metrics may differ depending on the classification problem
accuracy_score(y_test, test_preds)
precision_score(y_test, test_preds)
recall_score(y_test, test_preds)
f1_score(y_test, test_preds)
Classification Metrics
How do we evaluate how well our classification models are predicting correct labels/classes? We can use evaluation metrics provided within the scikit-learn package.
accuracy_score(y_true, y_pred)
calculates the accuracy classification scoref1_score(y_true, y_pred)
calculates the F1-score, which is the harmonic mean of the precision and recall scoresprecision_score(y_true, y_pred)
calculates what percent of the predictions were correctrecall_score(y_true, y_pred)
calculates what percent of the positive cases were caughtroc_auc_score(y_true, y_pred)
calculates the Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores
The Classification Report is a printed report showing the main classification metrics: precision, recall, and the F1 score, along with the sample size (indicated as ‘support’).
from sklearn.metrics import classification_report
classification_report(y_test, y_pred)
The Confusion Matrix is a function evaluates classification accuracy by computing the confusion matrix with each row corresponding to the true class. When creating a confusion matrix, you can normalize it such that it will show the percentage prediction of each class made by the model for that specific true label.
The higher the diagonal values of the confusion matrix the better, indicating many correct predictions of True Positives and True Negatives.
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(clf, X_test_scaled, y_test, normalize = 'true', cmap = 'Blues');