Ranking training. Metrics in machine learning tasks What does a quality metric mean in machine learning

In tasks machine learning Metrics are used to assess the quality of models and compare various algorithms, and their selection and analysis is an indispensable part of the work of a data scientist.

In this article, we'll look at some quality criteria in classification problems, discuss what's important when choosing a metric, and what can go wrong.

Metrics in classification problems

For demonstration useful functions sklearn and a visual representation of metrics, we will use our dataset on the outflow of telecom operator customers, which we met in the first article of the course.

Let's download the necessary libraries and look at the data

Import pandas as pd import matplotlib.pyplot as plt from matplotlib.pylab import rc, plot import seaborn as sns from sklearn.preprocessing import LabelEncoder, OneHotEncoder from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn.metrics import precision_recall_curve, classification_report from sklearn.model_selection import train_test_split df = pd.read_csv("../../data/telecom_churn.csv")

Df.head(5)


Data preprocessing

# Let's map the binary columns # and encode the staff using dummy encoding (for simplicity, it is better not to do this for wooden models) d = ("Yes" : 1, "No" : 0) df["International plan"] = df[" International plan"].map(d) df["Voice mail plan"] = df["Voice mail plan"].map(d) df["Churn"] = df["Churn"].astype("int64" ) le = LabelEncoder() df["State"] = le.fit_transform(df["State"]) ohe = OneHotEncoder(sparse=False) encoded_state = ohe.fit_transform(df["State"].values.reshape(- 1, 1)) tmp = pd.DataFrame(encoded_state, columns=["state " + str(i) for i in range(encoded_state.shape)]) df = pd.concat(, axis=1)

Accuracy, precision and recall

Before moving on to the metrics themselves, it is necessary to introduce an important concept for describing these metrics in terms of classification errors - confusion matrix(error matrix).
Let's assume that we have two classes and an algorithm that predicts that each object belongs to one of the classes, then the classification error matrix will look like this:

True Positive (TP) False Positive (FP)
False Negative (FN) True Negative (TN)

Here, is the response of the algorithm on the object, and is the true class label on this object.
Thus, classification errors are of two types: False Negative (FN) and False Positive (FP).

Training the algorithm and building the error matrix

X = df.drop("Churn", axis=1) y = df["Churn"] # Divide the sample into train and test, all metrics will be assessed on the test dataset X_train, X_test, y_train, y_test = train_test_split(X, y , stratify=y, test_size=0.33, random_state=42) # Train native logistic regression lr = LogisticRegression(random_state=42) lr.fit(X_train, y_train) # Use the error matrix construction function from the sklearn documentation def plot_confusion_matrix(cm, classes , normalize=False, title="Confusion matrix", cmap=plt.cm.Blues): """ This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """ plt.imshow(cm, interpolation="nearest", cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=45) plt.yticks(tick_marks, classes) if normalize: cm = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis] print("Normalized confusion matrix") else: print("Confusion matrix, without normalization") print(cm) thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape), range(cm.shape)): plt.text(j, i, cm, horizontalalignment="center", color="white" if cm > thresh else "black") plt.tight_layout() plt.ylabel("True label") plt.xlabel("Predicted label") font = {"size" : 15} plt.rc("font", **font) cnf_matrix = confusion_matrix(y_test, lr.predict(X_test)) plt.figure(figsize=(10, 8)) plot_confusion_matrix(cnf_matrix, classes=["Non-churned", "Churned"], title="Confusion matrix") plt.savefig("conf_matrix.png") plt.show()!}


Accuracy

An intuitive, obvious and almost unused metric is accuracy - the proportion of correct answers of the algorithm:

This metric is useless in problems with unequal classes, and this is easy to show with an example.

Let's say we want to evaluate the performance of a spam mail filter. We have 100 non-spam emails, 90 of which our classifier identified correctly (True Negative = 90, False Positive = 10) and 10 spam emails, 5 of which the classifier also identified correctly (True Positive = 5, False Negative = 5 ).
Then accuracy:

However, if we simply predict all emails as non-spam, we will get higher accuracy:

At the same time, our model does not have any predictive power at all, since we initially wanted to identify spam emails. The transition from a common metric for all classes to individual indicators of class quality will help us overcome this.

Precision, recall and F-measure

To assess the quality of the algorithm’s work on each of the classes separately, we introduce the metrics precision (accuracy) and recall (completeness).

Precision can be interpreted as the proportion of objects called positive by the classifier and actually being positive, and recall shows what proportion of objects of the positive class out of all objects of the positive class the algorithm found.

It is the introduction of precision that does not allow us to write all objects into one class, since in this case we get an increase in the False Positive level. Recall demonstrates the algorithm's ability to detect a given class in general, and precision demonstrates the ability to distinguish this class from other classes.

As we noted earlier, there are two types of classification errors: False Positive and False Negative. In statistics, the first type of error is called a type I error, and the second type is called a type II error. In our task of determining subscriber churn, an error of the first type will be mistaking a loyal subscriber for a leaving one, since our null hypothesis is that none of the subscribers leave, and we reject this hypothesis. Accordingly, an error of the second type will be the “missing” of a leaving subscriber and the erroneous acceptance of the null hypothesis.

Precision and recall, unlike accuracy, do not depend on the class ratio and therefore are applicable in conditions of unbalanced samples.
Often in real practice the task is to find the optimal (for the customer) balance between these two metrics. A classic example is the task of determining customer churn.
Obviously we cannot find everyone churn of clients and only their. But having determined the strategy and resource for retaining customers, we can select the necessary thresholds for precision and recall. For example, we might focus on retaining only high-yield customers or those who are more likely to churn because we have limited call center resources.

Typically, when optimizing the hyperparameters of an algorithm (for example, in the case of grid search GridSearchCV) one metric is used, the improvement of which we expect to see on the test sample.
There are several in various ways combine precision and recall into an aggregate quality criterion. F-measure (in general) - harmonic mean precision and recall:

In this case, it determines the weight of precision in the metric, and this is the harmonic mean (with a multiplier of 2, so that in the case of precision = 1 and recall = 1)
The F-measure reaches its maximum when recall and precision are equal to one, and is close to zero if one of the arguments is close to zero.
sklearn has convenient function _metrics.classification report returning recall, precision and F-measure for each class, as well as the number of instances of each class.

Report = classification_report(y_test, lr.predict(X_test), target_names=["Non-churned", "Churned"]) print(report)

class precision recall f1-score support
Non-turned 0.88 0.97 0.93 941
Churned 0.60 0.25 0.35 159
avg/total 0.84 0.87 0.84 1100

It should be noted here that in the case of problems with unbalanced classes, which prevail in real practice, it is often necessary to resort to artificial modification techniques of the dataset to equalize the class ratio. There are many of them and we will not touch on them; you can look at some methods and choose the one that suits your task.

AUC-ROC and AUC-PR

When converting the algorithm's real answer (usually the probability of class membership, see SVM separately) into a binary label, we must choose some threshold at which 0 becomes 1. A threshold of 0.5 seems natural and close, but it is not always turns out to be optimal, for example, in the above-mentioned lack of class balance.

One way to evaluate the model as a whole without being tied to a specific threshold is AUC-ROC (or ROC AUC) - area ( A rea U nder C urve) under the error curve ( R eceiver O perating C characteristic curve). This curve is a line from (0.0) to (1.1) in True Positive Rate (TPR) and False Positive Rate (FPR) coordinates:

TPR is already known to us, it is completeness, and FPR shows what proportion of the objects of the negative class the algorithm predicted incorrectly. In the ideal case, when the classifier makes no errors (FPR = 0, TPR = 1), we will get an area under the curve equal to one; otherwise, when the classifier randomly outputs class probabilities, the AUC-ROC will tend to 0.5 since the classifier will output the same number of TPs and FPs.
Each point on the graph corresponds to the choice of a certain threshold. The area under the curve in this case shows the quality of the algorithm (more is better), in addition, the steepness of the curve itself is important - we want to maximize TPR while minimizing FPR, which means our curve should ideally tend to the point (0.1).

ROC curve drawing code

Sns.set(font_scale=1.5) sns.set_color_codes("muted") plt.figure(figsize=(10, 8)) fpr, tpr, thresholds = roc_curve(y_test, lr.predict_proba(X_test)[:,1], pos_label=1) lw = 2 plt.plot(fpr, tpr, lw=lw, label="ROC curve ") plt.plot(, ) plt.xlim() plt.ylim() plt.xlabel("False Positive Rate ") plt.ylabel("True Positive Rate") plt.title("ROC curve") plt.savefig("ROC.png") plt.show()


The AUC-ROC criterion is robust to unbalanced classes (spoiler: alas, not everything is so simple) and can be interpreted as the probability that a randomly selected positive object will be ranked higher by the classifier (will have a higher probability of being positive) than a randomly selected negative object .

Consider the following problem: We need to select 100 relevant documents from 1 million documents. We developed two algorithms:

  • Algorithm 1 returns 100 documents, 90 of which are relevant. Thus,
  • Algorithm 2 returns 2000 documents, 90 of which are relevant. Thus,

Most likely, we would choose the first algorithm, which produces very few False Positives compared to its competitor. But the difference is False Positive Rate between these two algorithms extremely small - only 0.0019. This is a consequence of the fact that AUC-ROC measures the proportion of False Positive relative to True Negative and in tasks where the second (larger) class is not so important to us, it may not give an entirely adequate picture when comparing algorithms.

In order to correct the situation, let's return to completeness and accuracy:

  • Algorithm 1
  • Algorithm 2

Here a significant difference between the two algorithms is already noticeable - 0.855 exactly!

Precision and recall are also used to plot a curve and, similar to AUC-ROC, find the area under it.


It can be noted here that on small datasets the area under the PR curve may be overly optimistic, because it is calculated using the trapezoidal method, but usually in such problems there is enough data. For more information on the AUC-ROC and AUC-PR relationship, please go here.

Logistic Loss

The logistic loss function stands apart, defined as:

here is the algorithm's response on the -th object, the true class label on the -th object, and the sample size.

The mathematical interpretation of the logistic loss function has already been written in detail in the post about linear models.
This metric rarely appears in business requirements, but often in tasks on kaggle.
Intuitively, one can think of logloss minimization as a problem of maximizing accuracy by penalizing incorrect predictions. However, it should be noted that logloss heavily penalizes the classifier's confidence in the wrong answer.

Let's look at an example:

Def logloss_crutch(y_true, y_pred, eps=1e-15): return - (y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)) print("Logloss on uncertain classification %f " % logloss_crutch(1, 0.5)) >> Logloss with uncertain classification 0.693147 print("Logloss with confident classification and correct answer %f" % logloss_crutch(1, 0.9)) >> Logloss with confident classification and correct answer 0.105361 print(" Logloss with confident classification and WRONG answer %f" % logloss_crutch(1, 0.1)) >> Logloss with confident classification and WRONG answer 2.302585

Note how dramatically logloss increased with an incorrect answer and a confident classification!
Consequently, an error on one object can lead to a significant deterioration in the overall error on the sample. Such objects are often outliers, which must be remembered to be filtered or considered separately.
Everything falls into place if you draw a logloss graph:


It can be seen that the closer to zero the algorithm’s answer is at ground truth = 1, the higher the error value and the steeper the curve grows.

Let's summarize:

  • In the case of multi-class classification, you need to carefully monitor the metrics of each class and follow the logic of the solution tasks, not metric optimization
  • In the case of unequal classes, you need to select a balance of classes for training and a metric that will correctly reflect the quality of classification
  • mephistopheies and madrugado for their help in preparing the article.

Based on the outflow of telecom operator customers.


Let's download the necessary libraries and look at the data

import pandas as pd import matplotlib.pyplot as plt from matplotlib.pylab import rc, plot import seaborn as sns from sklearn.preprocessing import LabelEncoder, OneHotEncoder from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn.metrics import precision_recall_curve, classification_report from sklearn.model_selection import train_test_split df = pd.read_csv("../../data/telecom_churn.csv")


df.head(5)

Data preprocessing

# Let's map the binary columns # and encode the staff using dummy encoding (for simplicity, it is better not to do this for wooden models) d = ("Yes" : 1, "No" : 0) df["International plan"] = df[" International plan"].map(d) df["Voice mail plan"] = df["Voice mail plan"].map(d) df["Churn"] = df["Churn"].astype("int64" ) le = LabelEncoder() df["State"] = le.fit_transform(df["State"]) ohe = OneHotEncoder(sparse=False) encoded_state = ohe.fit_transform(df["State"].values.reshape(- 1, 1)) tmp = pd.DataFrame(encoded_state, columns=["state " + str(i) for i in range(encoded_state.shape)]) df = pd.concat(, axis=1)

Accuracy, precision and recall

Before moving on to the metrics themselves, it is necessary to introduce an important concept for describing these metrics in terms of classification errors - confusion matrix(error matrix).
Let's assume that we have two classes and an algorithm that predicts that each object belongs to one of the classes, then the classification error matrix will look like this:


True Positive (TP) False Positive (FP)
False Negative (FN) True Negative (TN)

Here, is the response of the algorithm on the object, and is the true class label on this object.
Thus, classification errors are of two types: False Negative (FN) and False Positive (FP).


Training the algorithm and building the error matrix

X = df.drop("Churn", axis=1) y = df["Churn"] # Divide the sample into train and test, all metrics will be assessed on the test dataset X_train, X_test, y_train, y_test = train_test_split(X, y , stratify=y, test_size=0.33, random_state=42) # Train native logistic regression lr = LogisticRegression(random_state=42) lr.fit(X_train, y_train) # Use the error matrix construction function from the sklearn documentation def plot_confusion_matrix(cm, classes , normalize=False, title="Confusion matrix", cmap=plt.cm.Blues): """ This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """ plt.imshow(cm, interpolation="nearest", cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=45) plt.yticks(tick_marks, classes) if normalize: cm = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis] print("Normalized confusion matrix") else: print("Confusion matrix, without normalization") print(cm) thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape), range(cm.shape)): plt.text(j, i, cm, horizontalalignment="center", color="white" if cm > thresh else "black") plt.tight_layout() plt.ylabel("True label") plt.xlabel("Predicted label") font = {"size" : 15} plt.rc("font", **font) cnf_matrix = confusion_matrix(y_test, lr.predict(X_test)) plt.figure(figsize=(10, 8)) plot_confusion_matrix(cnf_matrix, classes=["Non-churned", "Churned"], title="Confusion matrix") plt.savefig("conf_matrix.png") plt.show()!}


Accuracy

An intuitive, obvious and almost unused metric is accuracy - the proportion of correct answers of the algorithm:



This metric is useless in problems with unequal classes, and this is easy to show with an example.


Let's say we want to evaluate the performance of a spam mail filter. We have 100 non-spam emails, 90 of which our classifier identified correctly (True Negative = 90, False Positive = 10), and 10 spam emails, 5 of which the classifier also identified correctly (True Positive = 5, False Negative = 5).
Then accuracy:



However, if we simply predict all emails as non-spam, we will get higher accuracy:



At the same time, our model does not have any predictive power at all, since initially we wanted to identify spam emails. The transition from a common metric for all classes to individual indicators of class quality will help us overcome this.

Precision, recall and F-measure

To assess the quality of the algorithm’s work on each of the classes separately, we introduce the metrics precision (accuracy) and recall (completeness).




Precision can be interpreted as the proportion of objects called positive by the classifier and actually being positive, and recall shows what proportion of objects of the positive class out of all objects of the positive class the algorithm found.



It is the introduction of precision that does not allow us to write all objects into one class, since in this case we get an increase in the False Positive level. Recall demonstrates the algorithm's ability to detect a given class in general, and precision demonstrates the ability to distinguish this class from other classes.


As we noted earlier, there are two types of classification errors: False Positive and False Negative. In statistics, the first type of error is called a type I error, and the second type is called a type II error. In our task of determining subscriber churn, an error of the first type will be mistaking a loyal subscriber for a leaving one, since our null hypothesis is that none of the subscribers leave, and we reject this hypothesis. Accordingly, an error of the second type will be the “missing” of a leaving subscriber and the erroneous acceptance of the null hypothesis.


Precision and recall, unlike accuracy, do not depend on the class ratio and therefore are applicable in conditions of unbalanced samples.
Often in real practice the task is to find the optimal (for the customer) balance between these two metrics. A classic example is the task of determining customer churn.
Obviously we cannot find everyone churn of clients and only their. But, having determined the strategy and resource for retaining customers, we can select the necessary thresholds for precision and recall. For example, we might focus on retaining only high-yield customers or those who are more likely to churn because we have limited call center resources.


Typically, when optimizing the hyperparameters of an algorithm (for example, in the case of grid search GridSearchCV) one metric is used, the improvement of which we expect to see on the test sample.
There are several different ways to combine precision and recall into an aggregate quality measure. F-measure (in general) - harmonic mean precision and recall:



In this case, it determines the weight of precision in the metric, and this is the harmonic mean (with a multiplier of 2, so that in the case of precision = 1 and recall = 1)
The F-measure reaches its maximum when recall and precision are equal to one, and is close to zero if one of the arguments is close to zero.
sklearn has a handy function called _metrics.classification report, which returns the recall, precision, and F-measure for each class, as well as the number of instances of each class.


report = classification_report(y_test, lr.predict(X_test), target_names=["Non-churned", "Churned"]) print(report)
class precision recall f1-score support
Non-turned 0.88 0.97 0.93 941
Churned 0.60 0.25 0.35 159
avg/total 0.84 0.87 0.84 1100

It should be noted here that in the case of problems with unbalanced classes, which prevail in real practice, it is often necessary to resort to artificial modification techniques of the dataset to equalize the class ratio. There are many of them, and we will not touch on them; you can look at some methods and choose the one that suits your task.

AUC-ROC and AUC-PR

When converting the algorithm's real answer (usually the probability of class membership, see SVM separately) into a binary label, we must choose some threshold at which 0 becomes 1. A threshold of 0.5 seems natural and close, but it is not always turns out to be optimal, for example, in the above-mentioned lack of class balance.


One way to evaluate the model as a whole without being tied to a specific threshold is AUC-ROC (or ROC AUC) - area ( A rea U nder C urve) under the error curve ( R eceiver O perating C characteristic curve). This curve is a line from (0.0) to (1.1) in True Positive Rate (TPR) and False Positive Rate (FPR) coordinates:




TPR is already known to us, it is completeness, and FPR shows what proportion of the objects of the negative class the algorithm predicted incorrectly. In the ideal case, when the classifier makes no errors (FPR = 0, TPR = 1), we will get an area under the curve equal to one; otherwise, when the classifier randomly outputs class probabilities, the AUC-ROC will tend to 0.5 since the classifier will output the same number of TPs and FPs.
Each point on the graph corresponds to the choice of a certain threshold. The area under the curve in this case shows the quality of the algorithm (more is better), in addition, the steepness of the curve itself is important - we want to maximize TPR while minimizing FPR, which means our curve should ideally tend to the point (0.1).


ROC curve drawing code

sns.set(font_scale=1.5) sns.set_color_codes("muted") plt.figure(figsize=(10, 8)) fpr, tpr, thresholds = roc_curve(y_test, lr.predict_proba(X_test)[:,1], pos_label=1) lw = 2 plt.plot(fpr, tpr, lw=lw, label="ROC curve ") plt.plot(, ) plt.xlim() plt.ylim() plt.xlabel("False Positive Rate ") plt.ylabel("True Positive Rate") plt.title("ROC curve") plt.savefig("ROC.png") plt.show()



The AUC-ROC criterion is robust to unbalanced classes (spoiler: alas, not everything is so simple) and can be interpreted as the probability that a randomly selected positive object will be ranked higher by the classifier (will have a higher probability of being positive) than a randomly selected negative object .


Consider the following problem: We need to select 100 relevant documents from 1 million documents. We developed two algorithms:

  • Algorithm 1 returns 100 documents, 90 of which are relevant. Thus,

  • Algorithm 2 returns 2000 documents, 90 of which are relevant. Thus,


Most likely, we would choose the first algorithm, which produces very few False Positives compared to its competitor. But the difference is False Positive Rate between these two algorithms extremely small - only 0.0019. This is a consequence of the fact that AUC-ROC measures the proportion of False Positive relative to True Negative and in tasks where the second (larger) class is not so important to us, it may not give an entirely adequate picture when comparing algorithms.


In order to correct the situation, let's return to completeness and accuracy:

  • Algorithm 1

  • Algorithm 2


Here a significant difference between the two algorithms is already noticeable - 0.855 exactly!


Precision and recall are also used to plot a curve and, similar to AUC-ROC, find the area under it.



It can be noted here that on small datasets the area under the PR curve may be overly optimistic, because it is calculated using the trapezoidal method, but usually in such problems there is enough data. For more information on the AUC-ROC and AUC-PR relationship, please go here.

Logistic Loss

The logistic loss function stands apart, defined as:



here is the algorithm's response on the -th object, is the true class label on the -th object, and is the sample size.


The mathematical interpretation of the logistic loss function has already been written in detail in the post about linear models.
This metric rarely appears in business requirements, but often in tasks on kaggle.
Intuitively, one can think of logloss minimization as a problem of maximizing accuracy by penalizing incorrect predictions. However, it should be noted that logloss heavily penalizes the classifier's confidence in the wrong answer.


Let's look at an example:


def logloss_crutch(y_true, y_pred, eps=1e-15): return - (y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)) print("Logloss on uncertain classification %f " % logloss_crutch(1, 0.5)) >> Logloss with uncertain classification 0.693147 print("Logloss with confident classification and correct answer %f" % logloss_crutch(1, 0.9)) >> Logloss with confident classification and correct answer 0.105361 print(" Logloss with confident classification and WRONG answer %f" % logloss_crutch(1, 0.1)) >> Logloss with confident classification and WRONG answer 2.302585

Note how dramatically logloss increased with an incorrect answer and a confident classification!
Consequently, an error on one object can lead to a significant deterioration in the overall error on the sample. Such objects are often outliers, which must be remembered to be filtered or considered separately.
Everything falls into place if you draw a logloss graph:



It can be seen that the closer to zero the algorithm’s answer is at ground truth = 1, the higher the error value and the steeper the curve grows.

Let's summarize:

  • In the case of multi-class classification, you need to carefully monitor the metrics of each class and follow the logic of the solution tasks, not metric optimization
  • In the case of unequal classes, you need to select a balance of classes for training and a metric that will correctly reflect the quality of classification
  • and madrugado for assistance in preparing the article.



In the process of preparing the task for the entrance test for the GoTo summer school, we discovered that there was practically no qualitative description of the main ranking metrics in Russian (the task concerned a special case of the ranking problem - building a recommendation algorithm). We at E-Contenta actively use various ranking metrics, so we decided to correct this misunderstanding by writing this article. The task of ranking is now appearing everywhere: sorting web pages according to a given search query, personalization news feed , recommendations for videos, products, music... In a word, the topic is hot. There is even a special direction in machine learning that studies ranking algorithms capable of self-learning - learning to rank. To choose the best one from the variety of algorithms and approaches, you need to be able to evaluate their quality quantitatively. About the most common ranking quality metrics and we'll talk

Further.

Briefly about the ranking task Ranking is the task of sorting a set elements for their reasons relevance . Most often, relevance is understood in relation to someone object . For example, in the problem information retrieval

the object is a request, the elements are all kinds of documents (links to them), and relevance is the correspondence of the document to the request. In the recommendation task, the object is the user, the elements are this or that recommended content (products, videos, music), and relevance is the probability that the user will use (buy/like/view) this content.

Formally, consider N objects and M elements. The result of the operation of the algorithm for ranking elements for an object is a mapping that associates each element with a weight characterizing the degree of relevance of the element to the object (the higher the weight, the more relevant the object). In this case, the set of weights specifies a permutation on a set of elements (we assume that the set of elements is ordered) based on their sorting in descending weight. To evaluate the quality of ranking, it is necessary to have some “standard” with which the results of the algorithm can be compared. Let's consider - reference function relevance, characterizing the “real” relevance of elements for of this object

( - the element is ideally suited, - completely irrelevant), as well as the corresponding permutation (in descending order).
1. Based on historical data. For example, in the case of content recommendations, you can take the views (likes, purchases) of the user and assign the weights of viewed items to 1 (), and all others - 0.
2. Based on expert assessment. For example, in a search task, for each request you can involve a team of assessors who will manually evaluate the relevance of documents to the request.

It is worth noting that when it takes only extreme values: 0 and 1, then the arrangement is usually not considered and only the set of relevant elements for which is taken into account.

Purpose of Ranking Quality Metric- determine to what extent the relevance scores obtained by the algorithm and the corresponding permutation correspond true relevance values. Let's look at the main metrics.

Mean average precision

Mean average precision at K (map@K) is one of the most commonly used ranking quality metrics. To understand how it works, let's start with the “basics”.

Note: the "*precision" metric is used in binary problems, where it takes only two values: 0 and 1.

Precision at K

Precision at K (p@K)- accuracy on K elements - a basic metric of ranking quality for one object. Let's say our ranking algorithm produces relevance scores for each item. By selecting the first elements with the largest among them, you can calculate the proportion of relevant ones. This is exactly what precision at K does:

Note: we mean the element that, as a result of the rearrangement, ended up in the -th position. So, - the element with the largest, - the element with the second largest, and so on.

Average precision at K

Precision at K is a metric that is easy to understand and implement, but has an important drawback - it does not take into account the order of elements in the “top”. So, if out of ten elements we guessed only one, then it doesn’t matter what place it was in: the first or the last, in any case. It is obvious that the first option is much better.

This drawback is mitigated by the ranking metric average precision at K (ap@K), which is equal to the sum p@k over indices k from 1 to K only for relevant elements, divided by K:

So, if out of three elements only the one in last place turned out to be relevant, then , if we guessed only the one that was in first place, then , and if we guessed all of them, then .

Now we can handle map@K too.

Mean average precision at K

Mean average precision at K (map@K)- one of the most commonly used ranking quality metrics. In p@K and ap@K, the ranking quality is assessed for a single object (user, search query). In practice, there are many objects: we deal with hundreds of thousands of users, millions of search queries, etc. The idea of ​​map@K is to calculate ap@K for each object and average:

Note: this idea is quite logical if we assume that all users are equally needed and equally important. If this is not the case, then instead of simple averaging, you can use weighted averaging, multiplying ap@K of each object by the weight corresponding to its “importance”.

Normalized Discounted Cumulative Gain

Normalized discounted cumulative gain (nDCG)- another common ranking quality metric. As with map@K, let's start with the basics.

Cumulative Gain at K

Let's look again at one object and the elements with the largest . Cumulative gain at K (CG@K)- a basic ranking metric that uses a simple idea: the more relevant the elements in this top, the better:

This metric has obvious disadvantages: it is not normalized and does not take into account the position of relevant elements.

Note that, unlike p@K, CG@K can also be used in the case of non-binary reference relevance values.

Discounted Cumulative Gain at K

Discounted cumulative gain at K (DCG@K)- modification of cumulative gain at K, taking into account the order of elements in the list by multiplying the relevance of the element by a weight equal to the inverse logarithm of the position number:

Note: if it takes only the values ​​0 and 1, then , and the formula takes a simpler form:

The use of the logarithm as a discounting function can be explained by the following intuitive considerations: from a ranking point of view, positions at the beginning of the list are much more different than positions at the end of the list. So, in the case of a search engine, there is a whole abyss between positions 1 and 11 (only in a few cases out of a hundred does the user go beyond the first page search results), and there is not much difference between positions 101 and 111 - few people reach them. These subjective considerations are nicely expressed using the logarithm:

Discounted cumulative gain solves the problem of taking into account the position of relevant elements, but only aggravates the problem with the lack of normalization: if it varies within , it already takes values ​​on a segment that is not entirely clear. The following metric is designed to solve this problem

Normalized Discounted Cumulative Gain at K

As you can guess from the name, normalized discounted cumulative gain at K (nDCG@K)- nothing more than a normalized version of DCG@K:

where is the maximum (I - ideal) value. Since we agreed that takes values ​​in , then .

Thus, it inherits from taking into account the position of elements in the list and, at the same time, takes values ​​in the range from 0 to 1.

Note: by analogy with map@K, you can calculate , averaged over all objects.

Mean reciprocal rank

Mean reciprocal rank (MRR) is another frequently used ranking quality metric. It is given by the following formula:

Where - reciprocal rank for the th object - a very simple value in its essence, equal to the reverse rank of the first correctly guessed element.

Mean reciprocal rank varies over a range and takes into account the position of elements. Unfortunately, he does this only for one element - the first correctly predicted one, without paying attention to all subsequent ones.

Metrics based on rank correlation

Separately, it is worth highlighting ranking quality metrics based on one of the coefficients rank correlation. In statistics, a rank correlation coefficient is a correlation coefficient that takes into account not the values ​​themselves, but only their rank (order). Let's consider the two most common rank correlation coefficients: Spearman and Kendall coefficients.

Kendall's rank correlation coefficient

The first of these is the Kendall correlation coefficient, which is based on counting consistent
(and inconsistent) pairs of permutations - pairs of elements to which the permutations assigned the same (different) order:

Spearman's rank correlation coefficient

The second - the Spearman rank correlation coefficient - is essentially nothing more than the Pearson correlation calculated on the rank values. There is a fairly convenient formula that expresses it directly from ranks:

where is the Pearson correlation coefficient.

Metrics based on rank correlation have a drawback that is already known to us: they do not take into account the position of elements (even worse than p@K, since the correlation is calculated over all elements, and not over the K elements with the highest rank). Therefore, in practice they are used extremely rarely.

Metrics based on cascading behavior model

Up to this point, we have not delved into how the user (later we will consider a special case of an object - the user) examines the elements offered to him. In fact, we have implicitly made the assumption that viewing each element independent from viewing other elements - a kind of “naivety”. In practice, elements are often viewed by the user one by one, and whether the user views the next element depends on his satisfaction with the previous ones. Let's look at an example: in response to search query The ranking algorithm offered the user several documents. If the documents in positions 1 and 2 turned out to be extremely relevant, then the likelihood that the user will view the document in position 3 is small, because he will be quite satisfied with the first two.

Similar models of user behavior, where the study of the elements offered to him occurs sequentially and the probability of viewing the element depends on the relevance of the previous ones are called cascading.

Expected reciprocal rank

Expected reciprocal rank (ERR)- an example of a ranking quality metric based on a cascade model. It is given by the following formula:

where rank is understood in descending order. The most interesting thing about this metric is the probabilities. When calculating them, the assumptions of the cascade model are used:

where is the probability that the user will be satisfied with an object with rank . These probabilities are calculated based on the values ​​of . Since in our case, we can consider a simple option:

which can be read as: true relevance of the element in the position Finally, here are some useful links:

Often in the practice of a systems analyst who compiles an FRD, there are things that cannot be formalized. An example would be requirements like:

  • The application must run quickly
  • The application should consume little traffic
  • The video material must be of high quality.

Such requirements, when written into the FRD "as is", are a terrible source of problems later. Formalizing such requirements is a constant headache for the analyst. Typically, an analyst solves a problem in two steps: first, an “equivalent” formal requirement is put forward, then in the process of communication (with a customer, subject matter expert, etc.) it is proven that such a formal requirement can replace the original requirement. Generally speaking, the requirement we obtained is not functional; it describes not “what” the system should be able to do, but “how to do it.” In this case, “how to do” must be formulated with a specific qualitative characteristic.

This was the preamble to the thesis that Systems Analyst must have a good command of the mathematical apparatus and at the same time be able to explain “mathematics” to the customer. Now let's look at an example.

About the classification problem

Let's say we're writing an FRD for a contextual advertising system similar to Amazon Omakase. One of the modules of our future system will be a context analyzer:

The analyzer takes the text of a web page as input and performs a contextual analysis of it. How he does this is not of particular interest to us; It is important that at the output we get a set of product categories (many of which are predetermined). Then, based on these categories, we can display banners, product links (like Amazon), etc. For us, the analyzer is still a black box to which we can ask a question (in the form of document text) and receive an answer.

The customer wants the analyzer to be “good at detecting context.” We need to formulate what this requirement means. First, let's talk about the context as such, i.e. about the very set of categories that is returned by the analyzer. We can define this as a classification problem, where a document (web page) is matched to a set of classes from a predetermined number; in our case, classes are product categories. The classification problem is quite common in text processing (for example, spam filters).

Evaluation Metrics

Let's look at evaluation metrics applicable to a classification problem. Let's say we know correct categories for a certain number of documents. Let's group the responses of our hypothetical analyzer as follows:

  • True positives ( true positives) - those categories that we expected to see and received as a result
  • False positives ( false positives) - categories that should not be in the output and the analyzer erroneously returned them at the output
  • False negatives ( false negatives) - categories that we expected to see, but the analyzer did not identify them
  • True negatives ( true negatives) are categories that should not be at the output and they are also quite correctly absent from the output of the analyzer.

Let's call a test sample the set of documents (web pages) for which we know the correct answers. If we count the number of hits for each category (we count hits by couples document - category), we obtain a canonical table of the distribution of answers:

The left column of the table is the “correct” combinations of documents and categories (the presence of which we expect in the output), the right column is the incorrect ones. Top line the tables are positive responses from the classifier, the bottom table is negative (in our case, the absence of a category in the response). If the number of all pairs document - category equals N, then it is easy to see that

In general, now you can write down the customer’s requirement in the form (the number of incorrect answers is zero) and stop there. However, in practice such systems do not exist and the analyzer will, of course, work with errors relative to the test sample. The accuracy metric will help us understand the percentage of errors:

In the numerator we see the diagonal of the matrix - the total number of correct answers, which is divided by the total number of questions. For example, an analyzer that gives 9 correct answers out of 10 possible has an accuracy of 90%.

Metric F 1

A simple example of the inapplicability of accuracy metrics is the task of identifying a shoe brand. Let's say we want to count the number of mentions of shoe brands in the text. Consider a classification problem whose goal is to determine whether a specified entity is a shoe brand (Timberland, Columbia, Ted Baker, Ralph Lauren, etc.). In other words, we divide the entities in the text into two classes: A - Shoe brand, B - Everything else.

Now consider a degenerate classifier that simply returns class B (Everything Else) for any entities. For this classifier, the number of true positive answers will be 0. Generally speaking, let's think about the topic: how often do we come across shoe brands when reading text on the Internet? It turns out, oddly enough, that in the general case 99.9999% of the words in the text are not shoe brands. Let's build a response distribution matrix for a sample of 100,000:

Let's calculate its accuracy, which will be equal to 99990 / 100000 = 99.99%! So, we easily built a classifier that essentially does nothing, but has a huge percentage of correct answers. At the same time, it is completely clear that we have not solved the problem of defining a shoe brand. The fact is that the correct entities in our text are greatly “diluted” by other words that have no meaning for classification. Given this example, it is understandable to want to use other metrics. For example, the value tn is clearly "junk" - it seems to mean the correct answer, but the proliferation tn as a result, it greatly “suppresses” the contribution tp(which is important to us) into the accuracy formula.

Let us define the measure of accuracy (P, precision) as:

As is easy to see, the accuracy measure characterizes how many positive answers received from the classifier are correct. The higher the accuracy, the lower the number of false hits.

The accuracy measure, however, does not provide insight into whether the classifier returned all the correct answers. For this there is a so-called completeness measure (R, recall):

The measure of completeness characterizes the classifier’s ability to “guess” as many positive answers as possible out of the expected ones. Note that false positive responses do not affect this metric in any way.

Precision and Recall provide a fairly comprehensive description of the classifier, and “from different angles.” Usually, when building systems of this kind, you have to constantly balance between these two metrics. If you try to increase Recall by making the classifier more "optimistic", this causes Precision to drop due to an increase in the number of false positives. If you tweak your classifier, making it more “pessimistic”, for example, filtering the results more strictly, then as Precision increases, this will cause a simultaneous drop in Recall due to the rejection of a certain number of correct answers. Therefore, it is convenient to use one value, the so-called metric F 1, to characterize the classifier:

In fact, it is simply the harmonic mean of P and R. The F 1 metric reaches its maximum of 1 (100%) if P = R = 100%.
(it is not difficult to estimate that for our degenerate classifier F 1 = 0). The value of F 1 is one of the most common metrics for this type of system. It is F 1 that we will use to formulate the threshold quality of our analyzer in FRD.

There are two main approaches to computing F 1 for a classification problem.

  • Total F 1: we summarize the results for all classes into one single table, from which the F 1 metric is then calculated.
  • Medium F 1: for each class we form our own contingency matrix and our own F 1 value, then take a simple arithmetic mean for all classes.

Why is the second method needed? The point is that sample sizes for different classes can vary greatly. For some classes we may have very few examples, but for others we may have many. As a result, the metrics of one “large” class, being combined into one common table, will “clog” all the others. In a situation where we want to evaluate the quality of the system more or less uniformly for all classes, the second option is better suited.

Training and test set

Above we considered classification on a single sample, for which we know all the answers. If you apply this to the context parser we're trying to describe, things get a little more complicated.

First of all, we must fix the product categories. The situation when we guarantee some value of F 1, and the set of classes can expand indefinitely, is practically a dead end. Therefore, it is additionally stipulated that the set of categories is fixed.

We calculate the value of F 1 from a given sample, which is known in advance. This sample is usually called educational. However, we do not know how the classifier will behave on data that is unknown to us. For these purposes, the so-called test sample, sometimes called golden set. The difference between the training and testing samples is purely speculative: after all, having a certain set of examples, we can cut it into training and testing samples as we wish. But for self-learning systems, the formation of the correct training sample is very critical. Incorrectly selected examples can greatly affect the quality of the system.

A typical situation is when a classifier shows good results on the training set and completely fails on the test set. If our classification algorithm is based on machine learning (i.e., dependent on the training set), we can evaluate its quality using a more complex “floating” scheme. To do this, we divide all the examples we have, say, into 10 parts. We remove the first part and use it to train the algorithm; We use the remaining 90% of examples as a test sample and calculate the F 1 value. Then we remove the second part and use it as a training part; we get another value of F 1, etc. As a result, we received 10 F 1 values, now we take their arithmetic mean value, which will become the final result. I repeat that this is a method (also called cross-fold validation) only makes sense for algorithms based on machine learning.

Returning to writing FRD, we notice that our situation is much worse. We have a potentially unlimited set of input data (all web pages on the Internet) and there is no way to evaluate the context of the page other than human participation. Thus, our sample can only be formed manually, and is highly dependent on the whims of the compiler (and the decision about whether to classify a page into a particular category is made by a person). We can estimate the measure F 1 using examples known to us, but we cannot find out F 1 in any way for all Internet pages. Therefore, for potentially unlimited data sets (such as web pages, of which there are countless numbers), the unsupervised method is sometimes used. To do this, a certain number of examples (pages) are randomly selected and from them the operator (person) compiles the correct set of categories (classes). We can then test the classifier on these selected examples. Further, considering that the examples we have chosen are typical, we can approximately estimate the accuracy of the algorithm (Precision). At the same time, we cannot estimate Recall (it is unknown how many correct answers are outside the examples we have chosen), therefore, we cannot calculate F 1 either.

Thus, if we want to know how an algorithm behaves on all possible inputs, the best we can estimate in this situation is an approximate Precision value. If everyone agrees to use a predetermined fixed sample, then the average F 1 values ​​can be calculated based on this sample.

Eventually?

In the end we will have to do the following:

  1. Fix the training sample. The training sample will be built based on the customer’s ideas about the “correct” context.
  2. Fix a set of categories for our analyzer. We can't calculate F 1 from an indefinite set of classes, can we?
  3. Describe the requirement as: The analyzer must determine the context with an average F 1 value of at least 80%.(For example)
  4. Explain this to the customer.

As you can see, writing an FRD for such a system is not easy (especially the last point), but it is possible. As for the threshold value of F 1 , in such cases you can build on the values ​​of F 1 for similar classification problems.

This chapter presents popular methods for assessing the quality of a classification model, which are also used in other works on this topic. Their description and justification of the metrics used for this assessment are given.

Quality assessment metrics

Complete accuracy

This metric is one of the simplest and at the same time universal metrics for assessing the quality of classification algorithms. The value of this coefficient is calculated as the proportion of correctly classified objects from the total number of objects in the sample. This metric is popular due to its simplicity and the ability to extend to any number of classes. The main disadvantage of this metric is that it assigns equal weight to all documents, which may be incorrect if the documents in the training set are strongly biased towards one or more classes. This metric can have a high value, but the classifier within one class can show extremely low quality work. However, the metric does not signal this in any way.

Precision, recall and F-measure

Metrics such as precision and recall became widespread for the first time in assessing the quality of systems. solving the problem information search. The accuracy of the system within one class is the proportion of objects that actually belong to a certain class relative to all objects classified by the system to this class. Completeness is expressed as the proportion of objects found by the classifier belonging to a class relative to all objects of this class. Table 4 is a contingency table of a separate class, where TP (true positive) is a true positive solution, TN (true negative) is a true negative solution, FP (false positive) is a false positive solution and FN (false negative) is false - negative decision.

Table 1 - Object class contingency table

So precision and recall are calculated as:

The F-measure combines information about the accuracy and recall of the algorithm being evaluated. It is calculated as the harmonic average of accuracy and completeness:

Due to the fact that the F-measure is calculated separately for each class, it is convenient to use it to search and analyze specific errors in the algorithm, to evaluate classification with several classes. Moreover, in case large number classes, a characteristic is needed that would aggregate completeness and accuracy across all classes and characterize the overall behavior of the system. In this work, the following aggregated values ​​are used for this purpose: macro precision, which is calculated as the arithmetic average of the precision values ​​for all classes, macro recall, which is calculated as the arithmetic average of the completeness for all classes, and macro F- a measure (Macro F-score) that is the harmonic mean between them.

Cross-validation

One of the most common methods for conducting full testing and assessing the quality of work of various machine learning algorithms is cross-validation. For independent sampling this method allows you to obtain an unbiased estimate of the probability of error, in contrast to the average error on the training sample, which may be a biased estimate of the probability of error due to retraining of the algorithm. Another advantage of this procedure is the ability to obtain an estimate of the probability of an algorithm error, in the absence of a control sample specially designed for testing.

Let us assume that is a set of attribute descriptions of objects on which a finite selection of precedents is given, where is a finite set of classes. A mapping is specified that the algorithm assigns to an arbitrary sample of precedents. Then the quality of the algorithm’s work based on an arbitrary sample of precedents is assessed using the quality functional:

where is some non-negative function that returns the value of the algorithm error if the class label is correct.