Image search engine

Image search engine

Have you noticed a camera icon in the latest amazon shopping app? if you click and capture a photo of any object using that icon, then you are searching amazon using that image. Well in the background, Amazon will identify what object is in that image, and show you similar products from its catalog.

This is what I mean by the image search engine!!!

Amazon shopping app screenshots

Amazon will be having millions of products in a variety of categories in their database, Have you wondered technically how they search for similar products based on image and give you search results instantly? Then you are in the right place. In this blog, let’s understand what image search engine is all about

Following are the highlights/contents

  1. What is traditional DBMS and why not use it for indexing images?
  2. What is neural information retrieval and Approximate K-NN algorithms
  3. MNIST digits image search
  4. Scrapping images and pre-processing
  5. Indexing and searching custom images 

At the end of this post, we will implement image search engine for various types of images where you will be able to

  • Search for digit images from MNIST data set
  • Search for world leader’s images scrapped from google
  • Search for celebrity images from MiniCelebA data set

And see similar images from those data set. Following video show some of the image search results.


Code for all experiments related to this post is available at my GitHub repository VisualSearchEngine

What traditional DBMS means in this context?

The database management system is a software system that helps create, access, and manage databases. The main purpose of using DBMS is to store, manage, and retrieve a large amount of information quickly. There are different types of DBMS’s such as a relational, NoSQL(key-value, document, column, graph), hierarchical,  etc. And each of these different kinds of DBMS has various implementations. Oracle, SQL, IBM DB2, SQLite are some of the implementations of relational databases, MongoDB and CouchDB are implementations of document-based DB, Amazon’s DynamoDB is key-values based DB with an added feature of storing JSON to provide document-based store.

All I am saying is that we have so many databases, and each of them implements indexing of data, memory management, concurrency control, transaction management, maintaining ACID properties differently. And to mention, the main purpose why databases came into existence is to retrieve information from a large chunk of data instantly, and they do it using indexing technique. Indexing of data is the process of storing additional information about data in such a way that information retrieval can be done faster. 

Let’s say, if we have a table to store product details in our database, and we have to store millions of product information on our table, then we will have to index it so that when the user queries for a particular product, then that product information can be retrieved instantly.

Why can’t we use traditional DBMS for storing and indexing images

we can use these traditional databases to store and index images as blob data type but that is not recommended! Generally, we store metadata of images such as an image path, unique identifier, and probably width, height of an image in a database, but not the entire image.

For the sake of understanding indexing and searching, let’s take an example of searching Vicks inhaler in the Amazon shopping app, if we had searched the app by typing in text, then in the backend Amazon server would query a database for the product table with Vicks inhaler as the product description. Then the database would directly go to the index of the table and find the position of Vicks inhaler product and return back product information. This is how databases work with normal data(either string, int, float, boolean). But what about images? how would database store 

Image is nothing but a 3-dimensional matrix of pixel values, if an RGB image is of size 512x 512 then we have 512x512x3=786432 values to be stored, typically each pixel is represented using 0-255 value(8-bit representation). So storing 786432 integer values per image in the database and indexing each image based on pixel value doesn’t make sense because spatial information will be lost once you compare pixel positions from one image to another.  Even if it makes sense in some cases of images, comparing millions of images with each having 786432 values for each image is computationally expensive, hence it is highly discouraged to store and index images in the database.

And for the argument sake, let’s say we have Aadhaar database, containing images of all the people in entire India, and now if government of India wants to see if a particular person information is available or not using his image, then how will they do it?

One has to take person’s photo and search entire database for matching face image, but how will they do that? One way is to find the distance between given face image and the images in the entire database, and figure out the closest face image, distance metric can be either eculidian or cosine or any other relevant metric. But this will not work because even if a single pixel changes then distance in high dimensional space will vary a lot. Hence you need different way of comparing images.

What is neural information retrieval and Approximate K-NN algorithms

As we concluded that, we can not store and index each pixel value for millions of images, we need better image representation which uniquely identifies each image for storing and indexing in DB. This is where deep neural networks come to aid. But let’s understand what is information retrieval first.

Information retrieval is a system used to extract information from a large chunk of data. Usually, it is referenced in the context of analyzing and extracting information from text data in natural language processing. But this technique can be used in processing any kind of data such as text, speech, image, videos, etc.

Image referenced from tutorialspoint

Due to recent advancements in deep neural network approaches, and DNN’s giving the state of the art results, people started using neural networks to retrieve information. Compared to machine learning techniques where we need to feed what features to extract, DNN’s do not need to manually feed what features to extract, they automatically extract features which uniquely represent the input data. Hence making neural networks more robust for use in information retrieval. 

Information retrieved from data is a multi-dimensional vector, in layman’s terms, it is an array of float values which uniquely describe given data. length of this feature vector depends upon the model used to extract, it is usually of length 128, 1024, 2048, etc. So it is better to store 2048 values instead of 786432 values. But even if we store 2048 values in any of the traditional databases, will they be able to index data well?

And we are talking about building an image search engine, how does this information retrieval come into the picture?

Approximate K- NN algorithm

This is where approximate K nearest neighbor algorithms come to aid. We found out that instead of comparing each pixel values of images, we can just compare the feature representation of each of the images. So just like indexing is done to preprocess data in databases, we need to create an index of feature representation of all images. And just like we search in the database, we can search feature representation of our search image in our indexed feature vector database. So to index feature vectors, we use approximate K nearest neighbor algorithms.

Let’s understand first what are nearest neighbors. If we are considering a single-dimensional feature vector representation for each image, then we are representing each image with a single digit. And for the discussion sake, let’s assume that we have feature vector for 100 images to be 1 to 100 i.e. 

Feature representation of image_1 is [1], image_2 is [2], .. image_100 is [100]. And if we are to find 5 nearest neighbor of image_30, then image_27, image_28, image_29, image_31, image_32 are the ones because nearest values to feature vector 30 are 27, 28, 29, 31, 32. This is easy to calculate understand because we are considering Euclidean distances between images in 1-dimensional feature vectors. The same holds good for multidimensional feature vector space but the only thing is it is difficult to visualize.

There are different approaches for finding approximate nearest neighbors such as hash-based, tree-based, graph-based, etc. And there are multiple algorithmic implementations for each of these approaches, such as locality-sensitive hashing, K-dimensional tree, small-world graph, Hierarchical Navigable Small World graphs. 

We don’t need to reinvent the wheel nowadays because people are generous and share implementations of these algorithms in the form of libraries. Following are some of such libraries and their corresponding implemented algorithms

Non-Metric Space Library (NMSLIB), ANNOY, Facebook AI similarity search(FAISS), etc.

I found this interesting git repository to benchmark approximate nearest neighbor libraries at GitHub. And most of the state of the art models will be using high dimensional feature vector to uniquely describe images, it will be difficult for approximate K-NN algorithms to index millions of images with each having high dimensional data. 

MNIST digits image search

For experiment purposes, I will be using the hierarchically navigable small world graphs(HNSW) algorithm implementation of the NMSLIB library. The following are the search results of the each digits in the MNIST dataset. As you can see that top 100 results for each digit image searches are its corresponding digits from data set.  


In my previous post, I had discussed extracting features for the MNIST digits data set using a simple CNN network, and you can also find source code and output feature vectors at GitHub repo. Go directly to the output directory and start writing following code for loading feature vectors of all 70000 digits images 

from os import listdir
from os.path import isfile, join
import time

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def getFeatureVectors(dir):
    featureFiles = [join(dir, file) for file in listdir(dir) if isfile(join(dir, file))]
    #fileCount = 0
    titleRow = ["Index", "Label", "Data", "Feature Vector"]
    labelList = list()
    characterDataList = list()
    featureVectorList = list()
    for file in featureFiles:
        print("Parsing %s" % (file))
        data = pd.read_csv(file)
        dataFrame = pd.DataFrame(data, columns = titleRow)
        for i in range(len(dataFrame)):
            label = dataFrame['Label'][i].replace('[', '').replace(']', '').split(',')
            characterData =  dataFrame['Data'][i].replace('[', '').replace(']', '').split(',')
            featureVector =  dataFrame['Feature Vector'][i].replace('[', '').replace(']', '').split(',')
    for i in range(len(characterDataList)):
        characterDataList[i] = [int(float(value)) for value in characterDataList[i]]
        characterDataList[i] = np.array(characterDataList[i]).reshape(28, 28)
        featureVectorList[i] = [float(value) for value in featureVectorList[i]]
        labelList[i] = [float(value) for value in labelList[i]]
    return labelList, characterDataList, featureVectorList

labelList, characterDataList, featureVectorList = getFeatureVectors('./features/')

Then we will be loading these feature vectors into the NMSLIB library and use the HNSW algorithm to create an index of these FV using cosine distance. Then use one of the existing FV extracted from the digit image and get the top 100 approximate nearest neighbors using the following code.  

import nmslib

data = np.array(featureVectorList).astype(np.float32)
index = nmslib.init(method='hnsw', space='cosinesimil')
index.createIndex({'post': 2}, print_progress=True)
ids, distances = index.knnQuery(data[0], k=100)

fir, axs = plt.subplots(10, 10, figsize=(100, 100), facecolor='w', edgecolor='k')
axs = axs.ravel()
for i, id in enumerate(ids):

Scrapping images and pre-processing

As an image search using HNSW approximate nearest neighbor algorithm worked perfectly fine with the MNIST digits image data set, I wanted to give it a try for little more complicated data set. So I decided to create my own data set of world leaders and check to see how well these algorithms perform. I scrapped images from google by following the instructions from pyimagesearch.

Go to google images and type in your search query, as I am downloading images of world leaders, I am starting with honorable PM Narendra Modi. And select Settings -> Advanced search, under image size select Medium, under aspect ratio select Square, and leave the of the fields as it is, then click Advanced search

After getting search result, scroll down to load more than 100 images and then press F12 to open developer tools in your browser and paste code from following link the console section

File with URL’s will be downloaded, then run --urls urls.txt --output images/modi


The same way I downloaded images for few more world leaders. Now I wanted to make sure that all images which I downloaded are of the same size because when we process images for feature extraction, we expect images to have the same origin on coordinate system. Run from my Github. Which will dump images from each world leader folder into a single folder and resize images to 512×512 

python prepare_data\ -src data\images -dst data\processed_images


Indexing and searching custom images

The idea of our image search engine is to search a database of world leader’s images. i.e Given an image having the face of a person(leader), we are finding other similar images where the same person(leader) is available. And we understood that neural information retrieved from the DNN models gives the best result in finding approximate nearest neighbors. So we need to find a deep neural network model which is trained to recognize faces in a given image. And VGGFace2 model which is a convolutional neural network-based model developed by researchers at the Visual Geometry Group at Oxford is the best choice for this task.

I will be using Keras implementation of the VGGFace2 model provided by Rcmalli at  GitHub

def get_feature_extraction_model(self):
	if self.model_type == "vgg16":
		model = VGGFace(model=self.model_type, include_top=False
						, input_shape=self.image_size, pooling='avg')
		output = model.get_layer('conv5_3').output
		output = GlobalAveragePooling2D()(output)
		feature_model = Model(inputs=model.input, outputs=output)
	elif self.model_type == "resnet50":
		feature_model = VGGFace(model='resnet50', include_top=False, input_shape=(224, 224, 3), pooling='avg')
	return feature_model

def extract_face(self, image_file, required_size=(224,224)):
	image = plt.imread(image_file)
	start_time = time.time()
	rects = self.detector.detect_faces(image)
	end_time = time.time()
	#if self.debug:
	#print('Detected %d faces in %s at [%.3f] seconds' % (len(rects), image_file, (end_time - start_time)))
	if rects is None or len(rects) == 0:
		if self.debug:
			print('MTCNN did not detect face for: ' + image_file.split('\\')[-1])
		return None
	faces = []
	for rect in rects:
		x1, y1, width, height = rect['box']
		x2, y2 = x1 + width, y1 + height
		if y1 < 0 or y2 >= image.shape[0] or x1 < 0 or x2 >= image.shape[1]:
			if self.debug:
				print(str((x1, y1, x2, y2)) + ' is beyond image of size: ' + str(image.shape) 
				  + ' for file ' + image_file.split('\\')[-1])
			if x1 < 0:
				x1 = max(x1, 0)
			if y1 < 0:
				y1 = max(y1, 0)
			if x2 >= image.shape[1]:
				x2 = min(x2, image.shape[1])
			if y2 >= image.shape[0]:
				y2 = min(y2, image.shape[0])
		face = image[y1:y2, x1:x2]
		face = Image.fromarray(face)
		face = face.resize(required_size)
		face = np.asarray(face)
	return faces

def get_face_embeddings(self, images_list):
	count = 0
	start_time = time.time()
	faces_list = []
	iter_start_time = time.time()
	for image_file in images_list:
		faces = self.extract_face(image_file)
		if faces is None or len(faces) <= 0:
			#print('Did not find face in image: ' + image_file.split('\\')[-1])
		count += 1
		if count % 100 == 0:
			iter_end_time = time.time()
			print('Extracted %d faces in [%.3f] seconds ' % (count, (iter_end_time - iter_start_time)))
			iter_start_time = time.time()
	end_time = time.time()
	print('Extracted faces in %d images at [%.3f] seconds' % (len(images_list), (end_time - start_time)))
	count = 0
	face_image_map = dict()
	all_faces = []
	for id, faces in enumerate(faces_list):
		for face in faces:
			face_image_map[count] = id
			count += 1
	print('Total %d faces detected' % len(all_faces))
	start_time = time.time()
	faces = np.asarray(all_faces, 'float32')
	faces = utils.preprocess_input(faces, version=2)
	end_time = time.time()
	print('Preprocessed faces in [%.3f] seconds: ' % (end_time - start_time))
	model = self.get_feature_extraction_model()
	start_time = time.time()
	face_embeddings = model.predict(faces)
	end_time = time.time()
	print('Extracted features in [%.3f] seconds' % (end_time - start_time))
	return face_embeddings, face_image_map

I am using Google colab for all my experiments, so you will find code to extract features in .py file but those will be run inside colab jupyter notebook. And you can find all the necessary library installation commands at the beginning of each of the notebooks.

I am using MTCNN for detecting a face in the images, which gives us the bounding boxes(left, top, width, height) for each of the faces detected. We need to extract faces because VGGFace2 was trained on images where only the face is available. We also need to resize face images to (224 x 224) as per the input size taken by VGGFace2 models.

There are 3 models supported by VGG_Face library currently, and those are VGG16, RESNET50, SENET50. I have currently evaluated search results by using only VGG16 and RESNET50 models, maybe in the next post, I will showcase the pros and cons of each of these models.

python -model_type "vgg16" -debug False -images processed_square_images/

Above command will use model VGG16 to extract feature vectors from all the images containing faces. We need to feed these feature vectors to approximate K-NN algorithms for indexing purpose, and as usual we will be using NMSLIB library for the same and use HNSW algorithm

def index_save_features(self, face_embedding, face_image_map):
	feature_vectors = np.array(face_embedding).astype(np.float32)
	#index = nmslib.init(method='hnsw', space='cosinesimil')
	index = nmslib.init(method='hnsw', space='l2')
	index.createIndex({'post': 2}, print_progress=True)
	output_folder = '/content/gdrive/My Drive/FaceProcessing/code/output/l2_'
	hnsw_index_path = output_folder + args["model_type"] + '_FacialFeatures.hnsw'
	print('Saving hnsw index @ : ' + hnsw_index_path)
	face_image_path = output_folder + args["model_type"]+'_face_to_image.pkl'
	print('Saving face to image mapping disctionary @: ' + face_image_path)
	file = open(face_image_path, 'wb')
	pickle.dump(face_image_map, file)

The above index_save_features method will be used to index feature vectors and save it for a program for image search. We also need to maintain the mapping between feature vector and image because when we query or search a database with a feature vector of search image, the results will be an array of feature vectors or nearest neighbors. So we need to identify corresponding images of which these feature vectors belong to. And single image can have multiple faces,  and we need to extract features for each of the faces in images. you can find multiple faces in the below images. 

So I am maintaining a dictionary where the key is the index of feature vector list, and values are the image to which those face feature vectors belong to. 

Extracted faces in 839 images at [556.263] seconds Total 1530 faces detected Preprocessed faces in [0.887] seconds:
Extracted features in [12.120] seconds

As per the above code, totally of 1530 faces were detected from 839 images. The majority of the time was spent on extracting faces from images and very little time for extracting a feature vector from faces.

The following are some of the search results for a given leader’s faces.


As we can see the sensitivity of the result is very bad, we got a lot of false positives. And that’s because I haven’t trained my model for the images which I downloaded, so the feature representation of each image will not be exactly describing the faces within them. So we need some data set of images on which our model is already trained and know those faces. As VGG models are trained on CelebA data set, we can use it. But due to a large volume of data and as I don’t have a GPU machine to run, I decided to go with MiniCelebA data set.  It contains the same celebrity images but the number of images are lesser and just enough for our experiment.

I downloaded data set from and used the previous techniques to extract features and index them.


As you can see in the above results, the first resulting image is exactly matching the query image and other images in the top 100 are also almost similar in all 4 queries, that’s because the model is trained on these images, and extracted features represent images properly unlike images which we scrapped from google. Float value on each response image is the distance to the query image,  and the top image is always having 0 distance in all the above 4 results. 

I hope you were able to understand how to store images and index them using information retrieved from neural networks. Let me know your opinion in the comment section. 

4 thoughts on “Image search engine

  1. Thanks for the content, it helps alot for beginners like us to understand how an image comparison can be done.
    The flow of explanation is very nice and simple, it creates interest to read. .

  2. Hey Mahantesh!
    It was a lovely read..
    I’m an absolute beginner to Data Science domain as I’m from non technical background without any prior experience in coding or technical applications.

Comments are closed.

Comments are closed.