Performance metrics for binary Classification models

A deep dive into performance metrics for Machine learning models for classification models and selecting the best metric based on business scenario

  • Confusion matrix
  • Accuracy score
  • Precision score
  • Recall score
  • F1 score
  • F-beta score
  • Area under curve (AUC) & Receiver Operating Characteristic (ROC)

Confusion matrix

Confusion matrix, as the name suggest, measure how confusion the model is in predicting the classes. Let us consider the following table to understand how the confusion matrix is actually used:

Confusion matrix
from sklearn.metrics import confusion_matrix
# if you want frequency
confusion_matrix(y_true, y_predicted)
# if you want the normalized frequency distribution
confusion_matrix(y_true, y_predicted, normalize='true')

Accuracy score

This is the most commonly used scored in any classification problem. Accuracy score measures how accurately the model predicts the classes — out of total population, how many of the 0’s are correctly predicted as 0 and how many of the 1’s are correctly predicted as 1. Using the confusion matrix, it can be represented as:

Computation of Accuracy score from Confusion matrix
from sklearn.metrics import accuracy_score

accuracy_score(y_true, y_predicted)

Precision score

Precision helps us estimate the percentage of cases which are actually (true class) 1’s out of the total predicted 1’s. Using the confusion matrix, precision can be computed as:

Computation of Precision score from Confusion matrix
from sklearn.metrics import precision_score

precision_score(y_true, y_predicted)

Recall score

Recall helps us estimate the percentage of cases which are predicted as 1’s out of total cases where the actual class is 1. Using the confusion matrix, precision can be computed as:

Computation of Recall score from Confusion matrix
from sklearn.metrics import recall_scorerecall_score(y_true, y_predicted)

F1 score

There are cases, where maintaining a balance between precision and recall is of importance rather than looking at any one of these metric in standalone. This is where F1 score plays an important role. F1 score is the harmonic mean of precision and recall. Mathematically, F1 score is computed as:

from sklearn.metrics import f1_scoreprecision_score(y_true, y_predicted)

F-beta score

F-beta score is the weighted version of F1 score. F-beta introduces a new hyper-parameter β, which is chosen in a way that recall is considered β times as important as precision. Mathematically, F-beta score is computed as:

from sklearn.metrics import febta_scorefebta_score(y_true, y_predicted, beta=0.5)

AUC-ROC score

AUC or Area under the ROC curve is simply the probability that an example from positive class (1) will have higher score generated by the model as opposed to an example from the negative class (0). ROC or Receiver operating characteristic chart is built by plotting Sensitivity (or True Positive Rate: TPR) on the y-axis and (1 — Specificity) (or False Positive Rate: FPR) on the x-axis.

Computation of TPR and FPR
from sklearn.metrics import roc_curve, auc# first false postive rate (fpr) and true positive rate (tpr) need 
# to be extracted using roc_curve function
fpr, tpr, thresholds = roc_curve(y_true, pred_probability)
# pass the computed fpr and tpr into auc function to get score
auc(fpr, tpr)

Conclusion

In the above sections we have seen few key metrices that are commonly used in measuring the performance of a classification model. However, the choices of these metrics depends on the business use case and the problem that is intended to solve. While confusion matrix is the common one across models, other metrices depends on the cost impact of misclassification of labels and how is data (dependent variable) is distributed.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store