# NEW WEIGHTED EVALUATION METRIC FOR SEMANTIC SEGMENTATION ALGORITHMS P2

In the previous post, we discussed various aspects of evaluating segmentation algorithms using different metrics. And also we concluded that the existing metrics consider and evaluate all abnormality cluster sizes as the same. But we would like to emphasize abnormal pixels where it will be difficult for doctors to identify which usually will be small in size and isolated from the rest of the other abnormal pixels. So we would like to judge the algorithm as performing better only if it is able to detect small and isolated abnormalities along with obvious clusters.

We are deriving based on the understanding that AI algorithms will be used in healthcare to assist doctors. And they should be able to detect abnormalities that are very hard to find even for trained doctors. So we propose a new evaluation metric, basically change in existing metric such that importance is given to the smaller, isolated, and nonobvious abnormalities. New evaluation metrics should gauge the algorithm as performing badly if it is able to detect only large, clearly visible to doctors, and miss small and isolated abnormality clusters.

Segmentation task in the medical domain involves detecting unhealthy regions in a given patient image. And depending on the type of disease, the ratio of unhealthy regions in the image vary, typically in the medical domain, the unhealthy regions will be small in area. So challenge for segmentation algorithms is to identify those smaller and isolated segments along with obvious and large unhealthy regions. So with the current evaluation metrics such as accuracy, sensitivity, specificity, precision, area under ROC curve, area under the precision-recall curve, it is very difficult to evaluate the performance of the segmentation algorithm because of the imbalance in data set, as there will be a lot of true negatives (TN) values as compared to other possibilities.

We should penalize more for any algorithms ignoring smaller and isolated pixels and penalization should be done based on the proportion of each cluster area. Which of all is not done in traditional metrics. Any metric we take is derived from the confusion matrix. The confusion matrix is the building block for all other evaluation metrics. You can find equations of each of those metrics below

(1)

(2)

(3)

### Derivation of weighted evaluation metrics

Let us consider the below synthetic image for understanding purposes. It contains ground truth and the prediction of our segmentation algorithm, there are two large and two small exudate cluster in the given ground truth image, and the segmentation algorithm is able to predict both of the large two exudate clusters and one small exudate cluster, which will be considered as true positive. But it fails to identify one small exudate segment which will be considered as a false negative. And it also wrongly predicts a small segment of pixels as exudate but actually, it is healthy pixels, so these will be considered as false positive. And rest of the pixels will be considered as true negative as they are healthy regions and the segmentation algorithm also predicts these pixels as non-exudate pixels

As we want algorithms to predict those smaller and isolated exudate segments along with large clusters, we want to give higher weight to smaller clusters comparative to larger clusters. Now we have 2 unknown factors

- What should be the threshold value dividing smaller and larger clusters?
- What weight should be given to smaller and larger clusters?

In any healthcare problem, there will be no standard size and shape of abnormalities. Hence we will consider the inverse ratio of areas of each abnormal segment to the largest abnormal segment in GT. **If smaller and isolated segments are correctly or incorrectly predicted then corresponding TP and FN are amplified. **

(4)

In the traditional computation of confusion matrix, we iterate for each pixel value in the ground truth and consider its corresponding pixel value in segmentation algorithm prediction. Based on the comparison, we calculate TP, TN, FP, and FN values of the confusion matrix.

But while computing our new weighted confusion matrix, we calculate by considering the area/size of the cluster and take proportionate value in weighted_TP, weighted_TN, weighted_FP, and weighted_FN. Once we calculate the weighted confusion matrix, then we can proceed in calculating sensitivity, specificity, precision, accuracy, F1-score, etc with a weighted confusion matrix.

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

The alpha value should range from > 0 to infinity, but ideally would be between 0 ~1. Set it to 0.1 if the importance has to be given more to smaller clusters. And as and when alpha increases, the importance of smaller clusters begins to become as same as a larger cluster. Following is the pseudo-code

Hope the pseudo-code is self-explanatory. Here as we are evaluating the segmentation algorithms, we have to compare the performance of different algorithms solving the same problem. Later in this post, we will see examples of different algorithms predicting hard exudates and compare the performance w.r.t weighted evaluation. But first for the understanding purpose, let’s take simple, synthetic ground truth and predictions to see how weighted evaluation metric is judging and giving importance to smaller and isolated regions. Following is a comparative study of 5 algorithms prediction for a single input image. We have its corresponding ground-truth to calculate the traditional and weighted confusion matrix. As you can see, the first prediction identifies only the larger cluster, so according to sensitivity, 60% of the abnormal or desired pixel(TPs) are identified. But its weighted sensitivity says it’s good for only 47.3%. The second prediction is able to identify smaller and isolated pixels but fails to identify the larger clusters. This case weighted sensitivity metric says this algorithm performed better compare to 1^{st} algorithm. The below calculation is done by keeping the alpha value as 1.

We used the dataset provided from competition IDRID (Indian diabetic retinopathy image dataset) but by the time we started, the competition had already closed. So ended up using only the dataset and trained our model to predict hard exudates. As hard exudates on average constitute < 1% of the entire image. So processing and feeding directly the entire image to the neural network was not fruitful. Our model was not able to learn such minute details from existing features. Hence we used patch-based processing where we first divide the image into small patches of 40×40 and then feed these patches to the CNN model to train for hard exudate prediction.

We used UNet as the base architecture of the model and tuned the hyperparameters to suit the patch-based exudate detection for the IDRID dataset. We were able to able to achieve the following performance metric

Following are some of the outputs generated from the model which we trained, and its corresponding traditional and weighted evaluation metric values

In conclusion, all we are saying is that weighted metric should be used for evaluating semantic segmentation algorithms especially in medical domain where there is a imbalance in the data set as well as class. And we are considering the important aspect of evaluating algorithm which are able to identify smaller and isolated segments from rest of the clusters. This evaluation metric gives better understanding of how well semantic segmentation algorithms are performing. So instead of calculating confusion matrix with tradition method, confusion matrix should be calculated considering ratio of cluster area such that evaluation of algorithms make sense in medical domain especially where important segments which needs to be detected are small and isolated.