K-Means clustering of MNIST dataset

K-Means clustering of MNIST dataset

Updates as on 16th Jan 2020: Improved clustering accuracy to 98.5% using a simpler CNN model mentioned in Keras page. And instead of extracting a feature vector from the final activation layer, we should take the output of the layer just before the final activation layer.

In this post, we will be clustering MNIST digits dataset using the K-Means algorithm with accuracy close to 90%. And also we will understand different aspects of extracting features from images, and see how we can use them to feed it to K-Means algorithm as compared to traditional features. The complete code used in this post of extracting features and later using it for feeding K-Means algorithm for clustering will be available in my GitHub link. We will also understand how the elbow method is used to determine K. It’s like we are going to understand how Hello World! program for machine learning works. People knowing the fundamentals such as clustering, K-Means algorithm can skip to the implementational part.

Content

  • What is clustering
  • K-Means algorithm
  • MNIST dataset
  • Feature extraction
  • K-Means clustering
  • Evaluating clustering

What is clustering?

Clustering is the grouping of similar objects in simple words, and it’s a type of unsupervised learning in Machine learning. For example, WhatsApp folder in your phone will have a mixture of all kinds of images, and if you want to separate images of the document in one cluster/subfolder, and images of celebrating the festival as another cluster/subfolder and so on then you should be able to do it using unsupervised learning.

Unsupervised learning means you have a bunch of data in any format such as images, text, videos, documents, etc, and you want to group them together based on similarity, so you starting learning the similarity by observing the given input and cluster them. And all the algorithms will partition the data into a group of similar objects by discovering the structure or pattern in the data. And good clustering algorithm should give you minimal similarity across clusters and maximum similarity within the cluster. There are different types of clustering algorithms such as K-Means, Mean-shift, DBSCAN, Hierarchical agglomerative and divisional, etc.

K-Means algorithm

It is an unsupervised clustering algorithm, where it clusters given data into K clusters. following is the algorithm

  1. Choose K random points as cluster centers or cluster means.
  2. For all the N data points: Assign each data point “xi” to one of the K clusters – i.e that cluster whose center is closest to the data point

    For K clusters repeat

                C(i) = arg min ||xi – mk||2, i = 1,….,N

  3. Update the cluster center by taking the average of points within cluster.
  4. Repeat above two steps until converge or clusters mean doesn’t change.

MNIST dataset

All machine learning enthusiast would start from this dataset, it’s a dataset consisting of handwritten digits in the image format. It has a training set of 60000 and testing set of 10000. Each image is of size 28 X 28 grayscaled image making it 1 X 784 vector. It is easy to work with for beginners as we can achieve more than 95% accuracy using CNN models.

Feature extraction

Before we begin understanding how to cluster handwritten digits, let’s understand how it is done for other simpler cases where we have a 2-dimensional data set. Let’s consider that we are randomly sampling people from 3 different countries and plotting their weight and height as follows

import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator import make_blobs

features, label = make_blobs(n_samples=500, centers=3,
                       cluster_std=0.55, random_state=0)
plt.figure(figsize=(10, 6))
plt.scatter(features[:, 0], features[:, 1], s=50);

Assuming you just got the features(weight and height) value of people and asked to cluster/group them according to the pattern what we observe, then we use weight and height as 2 features in K-Means clustering algorithm. The sklearn library has the implementation of KMeans algorithm and all you need to do is feed the list of weight and height as features, it will take care of labeling the given dataset and following is the code snippet and output of the clustering.

kmeans = KMeans(n_clusters=3)
kmeans.fit(features)
kmeansLabels = kmeans.predict(features)
plt.scatter(features[:, 0], features[:, 1], c=kmeansLabels, s=50, cmap='viridis')

clusterCenters = kmeans.cluster_centers_
plt.scatter(clusterCenters[:, 0], clusterCenters[:, 1], c='red', s=200, alpha=0.8);

It’s amazing right, but what about images, what are we going to consider features as? I would think of using the pixel intensity values or use the gradient to detect edges and find the shape of the digits in our case. Or I could use SIFT (Scale-invariant feature transform) or HOG (Histogram of oriented gradients) features in the given digit images. But I took the benefit of having an amazing professor and colleagues for opinion and got to know that the current state of the art technique in machine learning is to use CNN(convolutional neural networks) for this kind of image related problems. Maybe you can try traditional methods of image processing and let me know in the comment section about the results.

So I went ahead and learned a bit of what exactly is convolutional neural networks are and how they use kernels or feature maps to extract very abstract and useful features. For my experimental purpose, I found very easy to use and most useful reference by Adrian Rosebrock in this pyimagesearch link. He used LeNet to train models to classify MNIST dataset and claims 98.49% accuracy only after 20 epochs. But he is explaining how to classify MNIST dataset which differs from what we are doing.

So in this post, I am not going deep dive into understanding how LeNet can accurately classify MNIST dataset, but use his code for our purpose of clustering MNIST dataset by extracting features from the model generated from LeNet. Let’s understand how convolutional neural networks work in some other post which I will keep updated, but as of now, you can refer above-mentioned link for quick understanding.

So in our previous example, we were using weight and height as features to cluster using the KMeans algorithm, and we are going to use the feature vectors extracted by CNN models to cluster MNIST dataset. Assuming you have trained model which best understands given MNIST dataset and it gives classification probability in last output layer, we will use layer just before the output layer to extract features because they will best describe the digit in the input image. Image source: link

Following is the code snippet for extracting feature vectors for LeNet model which was trained on MNIST dataset of 70000 images with an accuracy of 95.81.

lastLayerOp = K.function([model.layers[0].input],
                                  [model.layers[9].output])

In the LeNet architecture diagram above, we see convolution and activation layer combined but actually, they are added separately so considering that architecture would have layers like below.

  1. Convolution with 20 feature maps of size (5 x 5) layer
  2. Relu activation layer
  3. Max pooling layer
  4. Convolution with 50 feature maps of size (5 x 5) layer
  5. Relu activation layer
  6. Max pooling layer
  7. Flattening layer
  8. Dense fully connected layer
  9. Relu activation layer
  10. Dense fully connected layer
  11. Softmax activation layer

The 11th layer is the output layer which will have 10 neurons producing a probability of given image belonging to a particular class, and this probability is calculated by the output of the previous layer, so we would be interested in this layer and that’s why you see model.layers[9].output.

def dumpFeatures(layeredOutput, data, dataLabels, isTrain):
    chunkSize = 5000
    strName = "Train" if isTrain else "Test"
    fileName = outputDir + 'MNIST_' + strName + '_FV_' + str(0) + ".csv"
    titleRow = ["Index", "Label", "Data", "Feature Vector"]
    startTime = time.time()
    totalTime = startTime
    featureDataList = list()
    for idx in range(len(data)):
        x = [data[idx]]
        featureVector =  layeredOutput([x])[0]
        dataList = [idx, dataLabels[idx].tolist(), (data[idx] * 255.0).tolist(), featureVector.tolist()[0]]
        #print(dataList)
        featureDataList.append(dataList)
        if idx != 0 and (idx  + 1) % chunkSize == 0:
            print("Extracted %d %s data features in [%.3f seconds]" % (idx, strName, time.time() - startTime))
            startTime = time.time()
            fileName = outputDir + 'MNIST_' + strName + '_FV_' + str(idx + 1) + ".csv"
            df = pd.DataFrame(featureDataList, columns = titleRow)
            df.to_csv(fileName)
            featureDataList.clear()
    fileName = outputDir + 'MNIST_' + strName + '_FV_' + str(idx + 1) + ".csv"
    df = pd.DataFrame(featureDataList, columns = titleRow)
    df.to_csv(fileName)
    print("Extracted total of %d %s data features in [%.3f seconds]" % (len(data), strName, time.time() - totalTime))

I am dumping features learned by the LeNet and dumping them in the CSV file for processing it later. And I am dumping every information for all 70000 images  in the chunk of 5000 in single CSV file. You can use these extracted features and feed them to any kind of machine learning algorithms to solve your purpose just like we are using them to solve the clustering problem. Following is the sample format in which I am storing the information.

One sample feature vector representing an image is [2.1632354259490967, 0.6524435877799988, 2.5808801651000977, 4.8095502853393555, -10.707579612731934, 8.62977123260498, -3.723630905151367, -7.045200347900391, -2.6436192989349365, 0.665986955165863]. It’s an array of 10 float values, and I am storing data, and its true label also for evaluating the results from KMeans.

K-Means clustering of MNIST data

As I was saying we can use these feature vectors to for any kind of machine learning tasks such as classification, clustering, finding similarity between given images, but let’s focus on using these to cluster images into 10 groups because we know that given 70000 images belong to 10 digits. But we need to find K for clustering unknown dataset, we will discuss one of the methods called elbow method.

Let’s first read the dumped feature vectors and feed it to K-Means algorithm.

def getFeatureVectors(dir):
    featureFiles = [join(dir, file) for file in listdir(dir) if isfile(join(dir, file))]
    featureFiles.sort()
    #fileCount = 0
    titleRow = ["Index", "Label", "Data", "Feature Vector"]
    labelList = list()
    characterDataList = list()
    featureVectorList = list()
    for file in featureFiles:
        print("Parsing %s" % (file))
        data = pd.read_csv(file)
        dataFrame = pd.DataFrame(data, columns = titleRow)
        for i in range(len(dataFrame)):
            label = dataFrame['Label'][i].replace('[', '').replace(']', '').split(',')
            characterData =  dataFrame['Data'][i].replace('[', '').replace(']', '').split(',')
            featureVector =  dataFrame['Feature Vector'][i].replace('[', '').replace(']', '').split(',')
            
            labelList.append(label)
            characterDataList.append(characterData)
            featureVectorList.append(featureVector)
            
    for i in range(len(characterDataList)):
        characterDataList[i] = [int(float(value)) for value in characterDataList[i]]
        characterDataList[i] = np.array(characterDataList[i]).reshape(28, 28)
        featureVectorList[i] = [float(value) for value in featureVectorList[i]]
        labelList[i] = [float(value) for value in labelList[i]]
    return labelList, characterDataList, featureVectorList


labelList, characterDataList, featureVectorList = getFeatureVectors('output/features/')

I have featureVectorList consisting of a list of feature vectors, and characterDataList consisting a list of 28 x 28 grayscale value of the image.

But before feeding feature vectors to the K-Means algorithm, first we need to find the appropriate K, else it’s going to result very bad. So as there are multiple ways to determine K such as elbow method, dendrogram, silhouette score Analysis. We will see how the elbow method comes to our rescue.

The idea is very simple, for a range of most appropriate values of K, calculate the error i.e sum of squared error or within-cluster sum of errors(wcss). We will use wcss for finding optimal K

def findOptimalK():
    wcss = list()
    for k in range(1, 15):
        kmeans = KMeans(n_clusters=k)
        kmeans = kmeans.fit(featureVectorList)
        wcss.append(kmeans.inertia_)
        
    plt.figure(figsize=(15, 6))
    plt.plot(range(1, 15), wcss, marker = "o")

Following is the result

Elbow method says that you can find K where there is an elbow, i.e. before elbow error gradually decreases and after elbow the decrease isn’t significant, as you can see in the graph, after K = 10, the error doesn’t seem to decrease much, so we could conclude that there are mostly 10 groups in given data. But we already know that our input data consist of 10 groups.

So we can go ahead and ask sklearn K-Means implementation to cluster given data in 10 clusters as follows

k = 10
kmeans = KMeans(n_clusters=10)
kmeans = kmeans.fit(featureVectorList)
labels = kmeans.predict(featureVectorList)

Evaluating clustering

But how will check if the clustering happened properly or not? how will you know that all 0’s images are into one cluster? so let’s just dump the results by randomly selecting 100 images of each cluster and visualize how well it has been clustered.

def dumpResults():
    clusterData = []
    for i in range(10):
        clusterData.append([])
    for i in range(len(labels)):
        clusterData[labels[i]].append([characterDataList[i], labelList[i]])
        
    for l in range(10):
        print("Plotting for label %d" % (l))
        fig, axes = plt.subplots(nrows=10, ncols=10, sharex=True)
        fig.set_figheight(15)
        fig.set_figwidth(15)
        count = 0
        randomNoList = random.sample(range(0, len(clusterData[l])), 100)
        for i in range(10):
            for j in range(10):
                axes[i][j].imshow(clusterData[l][randomNoList[count]][0])
                count += 1
        fig.savefig(kMeansOPDir + 'kmeans_cluster' + str(l) + '.png')
    return clusterData

But just by dumping random 100 images of each cluster and checking if a cluster has the same digit images will not help us in evaluating the clustering efficiency. Hence we need to evaluate by finding the miss-clustered digit or rather calculate the maximum number of similar digits in a cluster. And divide this maximum number of similar digits in a cluster by the total number of digits in that cluster will give you accuracy! That’s the definition of accuracy right, I feel I am awesome right now for coming up with that 😀 , but in all my coding experience whenever I feel that way I will be doing some blunder. So if you guys figure out something wrong with calculating the accuracy of KMeans clustering then please do keep me updated by commenting.

def evaluateKMeans(clusterData):
    print('')
    totalAccuracy = 0
    for j in range(10):
        labelsClusteredList = []
        for i in range(10):
            labelsClusteredList.append(0)
        for i in range(len(clusterData[j])):
            labelsClusteredList[clusterData[j][i][1].index(1.0)] += 1
        clusterAccuracy = max(labelsClusteredList) / sum(labelsClusteredList) * 100.0
        totalAccuracy += clusterAccuracy
    print("KMeans clustering Accuracy " + str(totalAccuracy / 10))

KMeans clustering Accuracy 89.03957836396373

I see the accuracy of clustering MNIST data using the K-Means algorithm using the features extracted from LeNet to be close to 90%. So if you find out a mistake of calculating the accuracy in the above code or if there is another way to improve the accuracy of clustering using K-Means algorithm by feeding different kinds of feature vectors then please do so by posting it in the comment section.

For next time we will meet to understand how hierarchical agglomerative clustering works with varying distance in my next post.

8 thoughts on “K-Means clustering of MNIST dataset

  1. Hi, Mahantesh Ambi.

    What python version are you using for the K-means in the Mnist data set. I have been using version 2.7 and 3 and I get the same error.
    findOptimalK()
    File “k-means-2.py”, line 88, in findOptimalK
    kmeans = kmeans.fit(featureVectorList)
    File “/Library/Python/2.7/site-packages/sklearn/cluster/k_means_.py”, line 971, in fit
    return_n_iter=True)
    File “/Library/Python/2.7/site-packages/sklearn/cluster/k_means_.py”, line 311, in k_means
    order=order, copy=copy_x)
    File “/Library/Python/2.7/site-packages/sklearn/utils/validation.py”, line 552, in check_array
    “if it contains a single sample.”.format(array))
    ValueError: Expected 2D array, got 1D array instead:
    array=[].
    Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

    Thanks for your time.

  2. Thanks a ton, You helped me a lot. Could you please give me proper working code.

Comments are closed.

Comments are closed.