Evaluation

The solutions will be compared using area under the receiver operating characteristic (ROC AUC) curve and balanced accuracy (BACC). ROC AUC will be used to rank the submissions, whereas BACC will be used as an additional measure. We will use roc_auc_score and balanced_accuracy_score from sklearn Python library for calculation of the metrics. An example script can be found here.

ROC AUC:

The performance of a classifier can be assessed using a ROC curve by applying a range of thresholds on the probabilistic output of the classifier and calculating the true positive rate (sensitivity) and false positive rate (1-specificity). ROC AUC is a performance measure which is equivalent to the probability that a randomly chosen positive sample will have a higher probability of being positively classified than a randomly chosen negative sample [Fawcett 2006].

Balanced accuracy:

Balanced accuracy (BACC) normalizes the group size and is useful when the dataset is imbalanced [Brodersen et al. 2010]. Probabilistic nature of the predictions is not utilized, because only the accuracy of the most likely classification is considered.  BACC is calculated with the equation:

,

where TP = number of true positives, FP = false positives, TN = true negatives, and FN = false negatives. True positives are data points with true label i and correctly classified as such, while the false negatives are the data points with true label i incorrectly classified to a different class j ≠ i. True negatives and false positives are defined similarly.

Missing data:
Submission of the prediction for each sample is obligatory. If a prediction is missing, we treat it as a false prediction with a 1.0 probability.

References