A comprehensive guide to classification metrics

7 min readJul 7, 2023

While it’s common for beginners in the field of data science to rely on ‘Accuracy’ as the primary metric when evaluating classifier models for classification problems, it’s important to understand that there are several other metrics that can offer a clearer interpretation of performance.

Precision and Recall are the two best metrics to measure the performance of a model on the classification tasks in the industry. Many different & detailed interpretations can be made based on these metrics both on data and model. This article dwells in-depth on how these metrics can help evaluate and improve the performance of our classification model.

What is Precision & Recall?

To understand precision and recall value let’s first take a sample multi-class classification problem and the results generated by some model.

Consider, a multi-class classification dataset with target classes A, B, C, D.

💡 Note: The targetclass is the dependent variable of the dataset that the model tries to predict.

Let’s say the dataset contains 100 records in total. The distribution of records for each class is as follows:

A → 28 records
B → 23 records
C → 24 records
D → 25 records

This is known as the ground truth. From the above information, we can conclude that the data we have is balanced and the model has equal chances to learn the pattern for that particular class.

Now, assume that we have trained our model on the dataset. We use the model to predict the labels of the records in the test dataset.

💡 Note: The model is usually trained on a ‘train’ dataset and tested on the ‘test’ dataset.

Confusion Matrix

The confusion matrix contains the result of the prediction for the 100 records in the test dataset (i.e. different from the train dataset).

The above table is known as the Confusion Matrix. This compares the ground truth and the predictions of the model. Here, the columns represent the ‘Predicted Values’ and the rows represent the ‘Actual Values’.

Observations from the Confusion Matrix

Diagonal values represent the classes that are correctly predicted.
The values in the upper & lower triangular matrix consist of misclassified values.

A perfect confusion matrix would contain all the values in the diagonal and no values in the other cells. But, in real-world scenarios, there will be imperfections. An example of one such thing is the given above.

In the case of class ‘B’, one particular interpretation we can make from the above matrix is that the model classified 17 records that originally belonged to class ‘B’ as class ‘D’. This means that the dataset used to train the model contains mislabeled classes during the process of dataset preparation (i.e. there is a human error).

Let’s take the class A,

Considering the row, we can infer that out of 28 records belonging to class ‘A’, 18 were correctly classified as class ‘A’ but (3+1+6) were misclassified as other classes, i.e. 10 records that belong originally belonged to other classes were misclassified as ‘A’.

Consider the column, the total number of records predicted as class ‘A’ is (18+2+4+10)=34. Out of them, 18 were correct and the other records predicted class ‘A’ originally belonged to classes other than class ‘A’.

Records that are correctly classified as class ‘A’ are referred to as True Positives (TP) = 18.
Records which are classified as ‘class A’ but originally belonging to other classes are known as False Positives (FP) = 10
Records that originally belonged to class ‘A’ but were classified as other classes are known as False Negatives (FN) = 16.
Records that neither belong to class ‘A’ nor are classified as class ‘A’ are known as True Negatives (TN) = 56.

Precision

Precision is the proportion of correctly predicted positive records (TP) out of all the records identified as positive (TP + FP). In other words, it is a measure of the correctness of predictions.

Recall

Recall is the proportion of correctly predicted positive records (TP) out of all actual positive records (TP + FN). In other words, it is the measure of completeness or inclusiveness of predictions.

We calculate the Precision & Recall values for each class.

Analysis of Precision and Recall

Usually, we use precision and recall to analyze how well the model has performed in each class. There are four possible scenarios based on precision and recall value.

Scenario (1) Both Precision & Recall are High

Here, both the precision and recall values are high. This is the scenario in the case of class ‘C’.

Class ‘C’ has higher Precision and Recall values. From this, we can conclude that our classifier has performed well in identifying the patterns in the records that belong to this class and also correctly classifies them.

Scenario (2) Precision is High & Recall is Low

In this scenario, the precision is high and recall is low. This is the scenario in the case of class ‘B’.

💡 Note: I took class ‘B’ as an example. But in real, class ‘B’ has low precision and recall values.

This means that the records classified as belonging to class ‘B’ in our classifier are correct but, there are more records that belong to class ‘B’ but are either misclassified or unclassified (i.e. in the case of a multi-label classification problem).

💡 Note: A multi-label classification problem is one in which one record can have more than one class assigned to it. They are known as labels.

A few interpretations can be made in our classifier and the data based on this scenario.

There is a certain pattern present in the group of records that are correctly classified by our classifier.
There are no significant patterns among the records that are either misclassified or unclassified (i.e. in the case of a multi-label classification problem).
There are certain patterns in the data that are common between the records that of correctly classified and the remaining unclassified or misclassified records, but are complex for the model to identify.
The threshold we set is too high such that there are records that are classified as ‘B’ but with high confidence but lower than the threshold. (i.e. in the case of a multi-label classification problem)

💡 Note: For a record, the classifier outputs the probabilities (also known as confidence levels) of it belonging to different classes. We take a list of probabilities of all and assign a class that has the highest probability to a record in case of the multi-class classification problem. In case of multi-label classification, we assign a particular class for the records which have confidence levels greater than the threshold.

Scenario (3) Precision is Low & Recall is High

In this scenario, the precision is low and recall is high. This is the scenario in the case of class ‘A’.

This means that our classifier correctly classifies most of the records that belong to class ‘A’, but also misclassifies records that belong to other classes as class ‘A’

Here too, based on this scenario, a few interpretations can be made.

The records which belong to the class ‘A’ do not have a significant pattern among them.
Class ‘A’ has less support in our data.
The threshold we set is too low such that the records that have a lower confidence level for a particular class are also assigned that class. (i.e. multi-label classification problem).

Scenario (4) Both Precision & Recall are Low

Here, both the precision and recall values are both low. This is the scenario in the case of classes ‘D’ & ‘B’.

This means that the classifier performs poorly in identifying the patterns among the records that belong to this class.

The data doesn’t have enough records belonging to that particular class for the classifier to learn patterns.
There is no significant pattern to learn.

Appendix

We can also calculate precision & recall for all classifiers as a whole. We aggregate the precision & recall values for all the classes into one by averaging them.

It can be done in 2 ways in the case of multi-class classification and 3 ways in the case of multi-label classification.

Micro Precision and Micro Recall
Macro Precision and Macro Recall
Sample Precision and Sample Recall (i.e. in case of multi-label classification)

Macro Precision & Recall

It is the straightforward way of averaging precision and recall values for all classes.

Micro Precision & Recall

Micro Precision is the ratio of all the records that are correctly classified to the total number of records predicted as positive by the model when considering individual instances.

Micro recall is the ratio of all the records that are correctly classified to the total number of records.

Sample Precision & Recall

In the case of multi-label classification, precision & recall values can be calculated for each record (or) sample.

Sample precision is the ratio of labels correctly predicted to the number of labels predicted per record. Sample recall is the ratio of labels correctly predicted to the actual number of labels per record.