New Weighted Evaluation Metric for Semantic Segmentation Algorithms P1

New Weighted Evaluation Metric for Semantic Segmentation Algorithms P1

In this post, we will understand the importance of needing a new evaluation metric in healthcare AI, but before jumping directly let’s set some context of what, where, and how. Usage of AI has increased substantially in healthcare lately as it is able to assist doctors in diagnosing patients better. It is able to give much more insightful data which doctors dint even think of using to diagnose. Many medical institutions are collaborating with big companies and start-ups to solve life-threatening diseases using AI. They are working together to collate data and use it appropriately to train AI models to assist doctors. That being said, it doesn’t mean AI is going to replace doctors, AI is nowhere near that stage but it’s evolving fast. You can find Google CEO Sundar Pichai talking about the same in his Google IO keynote speech 

AI is not only assisting doctors by saying a person has abnormalities or not by seeing X-ray or CT scans, but rather it will be able to pinpoint in scans where exactly abnormalities are available pixel-wise. So that doctors can focus their analysis on those areas in the scans. In this post, we will be focusing on the AI algorithms(segmentation) which manage to identify abnormalities in scans and how those algorithms are evaluated, and using what metric.

What are segmentation algorithms?

Algorithms that help you identify desired pixels or regions in images are called segmentation algorithms, mainly in image processing, we do categorize them as semantic segmentation algorithms and instance segmentation algorithms. But we do want to focus on semantic segmentation algorithms and their usage in the healthcare domain. In semantic segmentation algorithms, we classify each pixel in an image into different classes.  Usually, in the medical domain, we identify each pixel as a pixel containing abnormality or it is healthy.

Throughout this post, let’s discuss w.r.t eye complication called Diabetic retinopathy(DR). DR is an eye condition due to diabetes and is a leading cause of blindness. It’s caused by damage to the blood vessels of the light-sensitive tissue at the back of the eye (retina). Ophthalmologists diagnose by analyzing the retinal scan of the eye called retinal fundus images. It’s important for ophthalmologists to identify abnormalities such as exudates(hard and soft), hemorrhages, etc. This is where segmentation algorithms come into the picture, where we will use them to identify abnormalities in retinal scans. The following images showcase segmentation of abnormalities in retinal scans

The job of both ophthalmologists is to identify such abnormalities in scans and diagnose patients appropriately. But due to low-quality equipment or lack of trained ophthalmologists, most patients miss out on very important diagnoses. So we need AI to do these jobs and assist doctors in healthcare. If the resolution is good enough, doctors will be able to identify abnormalities just like 1st image from the below stack, but most of the critical abnormalities will be very minute and isolated, and in smaller cluster size.

Hence it is very important for algorithms to predict which are very hard for ophthalmologists to identify along with normal identifiable abnormalities segments. Because if algorithms also identify which are clearly visible to doctors and fail to miss which are nonobvious then algorithms/AI are not much of assistance to doctors.

AI algorithms, especially in healthcare, should be very cautious and give more importance to the detection of abnormalities that are nonobvious or nondetectable from experienced doctors. Hence we should tune our algorithms in such a way that along with detecting large obvious detectable abnormality clusters, it should also detect in a similar way nonobvious and nondetectable abnormality which will be small and isolated in nature.

Evaluation of Segmentation Algorithms

Generally, when we are building segmentation algorithms, we need to understand how well the model is getting trained. We start by splitting the dataset into the train, test, and validation sets. And while the model is getting trained, at each epoch we calculate error and other evaluation metrics to understand whether it is overfitting or underfitting. And then we adjust/tune our hyperparameters to make it fit just fine. Even we perform different forms of cross-validations. All this is well and good but are these evaluation metrics giving us correct intuition of our algorithmic performance? whether >~97% accuracy means that model is performing as per the expectation which we set in our above understanding? For that let’s understand what kinds of evaluation metrics we use in the medical/healthcare domain and how we judge the performance of these segmentation algorithms.

Different types of evaluation metrics

There are different ways of evaluating how well is our segmentation algorithms are performing such as accuracy, sensitivity, specificity, precision, F-score, Dice, Jaccard similarity, etc. We use different metrics in different scenarios and for different problem statements. It purely depends upon what we are trying to achieve and what is our objective function. Like if we want to reduce false positives(FP) then we use precision, similarly, if we want to reduce false negatives(FN) then we use recall/sensitivity. If both false FP and FN are important then we use F1-score, or if the impact of FP > FN, then we set the beta value to 0.5, else we set in the range of 1~10, and so on. Let’s not focus on where are when each of these metrics is used, rather let’s focus on how are these bad when evaluating segmentation algorithms in healthcare.

But what is the problem?

But what is the problem with these metrics? many papers and journals are published by benchmarking their dataset and algorithms with these metrics, but still, I am saying algorithms could be evaluated in a better intuitive way. For that let’s understand the fundamental problems in the healthcare segmentation problems.

Usually, in healthcare, sensitivity and specificity are used to measure the segmentation algorithm performance. Sensitivity/Recall is defined as the ratio of the total number of correctly classified positive examples divide by the total number of positive examples. Specificity is defined as the proportion of actual negatives, which got predicted as the negative. But as you can see from the below figure, not all exudate pixels are properly detected by our segmentation algorithm. But when we calculate the sensitivity and specificity values, it turns out to be 0.642 and 0.999 respectively. This seems to be reasonably good but the actual prediction is bad. This is mainly because of the large exudate cluster having a high true positive value. Resulting in giving a higher sensitivity value. And TN is very much dominant due to which specificity is also very high

In the below diagram, 1st image is the input retinal scan, 2nd image is the ground truth or label(hard exudates). In 3rd image, I am just overlapping the input image and label to show you where exactly are the abnormality. Following are the problems

  1. On average, the proportion of hard exudates is <1%. That means, there is a huge class imbalance.
  2. If the segmentation algorithm is able to predict pixels in the larger cluster and failed to predict a smaller and isolated abnormality cluster then, the overall sensitivity value will be higher because the larger predicted cluster will have higher TP. But intuitively it would have failed to predict the most critical segments. For example in the above image, if the algorithm failed to predict abnormalities near the optic disc(white oval-shaped region) then it is not of much use for doctors because ignoring that abnormality can cause blindness to the patient.  
  3. If we calculate the accuracy of any worst-performing algorithm, even for the below image it will be ~99.9%. Because TN dominates TP while calculating accuracy in most of the scenarios in healthcare problems. That’s why we don’t use accuracy in healthcare.

All I am trying to say is that intuitively, existing evaluation metrics don’t portray the actual performance of segmentation algorithms which is needed for assisting doctors. If doctors use a segmentation algorithm that says it has 99% accuracy and >80% sensitivity, then that algorithm may or may not be helpful to him because it would detect the abnormal segments which even he might have identified very easily.

Evaluation metrics specifically in the healthcare domain should say segmentation algorithm is performing better only if it is able to detect nonobvious, small, and isolated cluster which doctor has high chances of missing it, along with obvious cluster. So while I was working on my MTech thesis under the guidance of Prof. Neelam Sinha, we faced such problems while evaluating segmentation algorithms, what metric says and what algorithms were predicting was not having any correlation. Even though algorithm performance says better, but in reality, it wasn’t! So we came up with a new evaluation metric that intuitively portrays the performance of segmentation algorithms better. We will understand how we derived it and how it is solving in the upcoming P2 post.

Comments are closed.