Visual control using hand gestures(MediaPipe)

Visual control using hand gestures(MediaPipe)

We all know how Alexa/Siri helps us to control applications using voice, here machines listen to human voice and acts accordingly. Similarly, we can also control applications using visual input, because machine not only has ears(microphone), but has eyes(camera) too. It’s all thanks to latest enhancements in the field of AI, which is helping us to completely change the way we interact with the computers. We never know when mouse will be extinct :P.

As far as i know, AI in computer vision has been advanced a lot compared to processing voice data. But there hasn’t been such unique application like Alexa/Siri to end customers. Or there might be which i don’t know. Google has released Mediapipe which has multiple solutions using visual data such as face detection(both ROI and 3D landmark points), hand and body pose estimation, hair segmentation, object detection etc. which can be used to build such amazing applications.

In this post, we will learn how to do basic operations on webcam feed using hand gestures. But we can use this feature literally in wide variety of applications such as to edit images, videos, controlling video playback, drag and drop utility for windows, etc. I will try to keep it at as breif as possible because majority of the heavy lifthing is done already by Google. Code explained in this post is available here. Following video demonstration show all the functionalities. Its flickering while zooming and rotating a bit because of lightening condition and the low quality webcam. When I test the same code in Mac then its works perfectly fine without any flicker. So all these factors affect the hand detection and landmark prediction. But the stage at which it is, I say it’s pretty good.

Mediapipe has support(wrapper) for Android, IOS, Python, C++ and JS. I  have used python because we could do quick POC. I use Anaconda as python distribution for package management and deployment. You can install anaconda based on your OS. Once the conda is set up, you can create a new environment using following command, which will create a seperate env with latest python version

conda create -n visual_control python=3

Once env is created, you can activate and install mediapipe

1. conda activate visual_control
2. pip install mediapipe

It will install all the necessary dependencies.  To begin with, I took the initial building block from mediapipe python API. Modified it further to achieve 3 functionality namely

  1. Zooming in/out 
  2. Rotation
  3. Drawing on screen

For processing image and fetching webcam feed, OpenCV is being used. When we launch webcam, OpenCV grabs images from connected and choosen camera and gives it in the form of numpy array. We first convert it from BGR to RGB format and pass it onto to Hands.process() method, which returns the hand landmarks and handedness (left v.s. right hand) of each detected hand. By default maximum number of hands detected in the frame is 2, so if needed we can change it as per the need. And we are mainly interested into variables multi_hand_landmarks and multi_handedness. multi_hand_landmarks variable contains list of hand information, and each hand contains list of 21 landmark points in terms (x, y, z) values. multi_handedness variable contains classification information namely which hand(left or right), score or confidence of its classification result and index. So dont get missguided by this index variable because it doesn’t represent the index of left or right. I don’t know why google has done this way. So one major caveat which we need to take care is that how to correlate multi_hand_landmarks and multi_handedness. Luckily the order in which multi_hand_landmarks sends us information, that order is what maintained in index variable. You will get to know what i mean once you see below code. 

results = hands.process(image)
if results.multi_hand_landmarks:
    for idx, hand_landmarks in enumerate(results.multi_hand_landmarks):
        which_hand = results.multi_handedness[idx].classification[0].label
        if which_hand == "Left":
            # Process left hand landmark points
        elif which_hand == "Right":
            # Process right hand landmark points

In order to make it interesting, I have enabled all the visual controls only when the left hand is closed and you do all the defined operations using right hand only. This can be easily changed afterall its just the condition. But the main crux of the logic resides in detecting the hand gestures. For this i have used hardcoded rules, which is working pretty well as of now. But if we want to include some complex gestures to be detected then we may have to employ LSTM based models, for which there are already many interesting works done. First and foremost, we need to display what is being detected and tracted by Mediapipe, so let’s render hand landmarks onto the screen to see what exactly are those.

self.drawing_styles = mp.solutions.drawing_styles
self.mp_drawing = mp.solutions.drawing_utils
self.mp_drawing.draw_landmarks(
            image,
            hand_landmarks,
            self.mp_hands.HAND_CONNECTIONS,
            self.drawing_styles.get_default_hand_landmark_style(),
            self.drawing_styles.get_default_hand_connection_style(),
        )

xlist = [
            hand_landmarks.landmark[self.mp_hands.HandLandmark.WRIST].x,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.THUMB_MCP].x,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.THUMB_IP].x,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.THUMB_TIP].x,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.INDEX_FINGER_MCP].x,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.INDEX_FINGER_PIP].x,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.INDEX_FINGER_DIP].x,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.INDEX_FINGER_TIP].x,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.MIDDLE_FINGER_MCP].x,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.MIDDLE_FINGER_PIP].x,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.MIDDLE_FINGER_DIP].x,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.MIDDLE_FINGER_TIP].x,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.RING_FINGER_MCP].x,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.RING_FINGER_PIP].x,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.RING_FINGER_DIP].x,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.RING_FINGER_TIP].x,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.PINKY_PIP].x,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.PINKY_DIP].x,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.PINKY_TIP].x,
        ]
        ylist = [
            hand_landmarks.landmark[self.mp_hands.HandLandmark.WRIST].y,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.THUMB_IP].y,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.THUMB_TIP].y,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.INDEX_FINGER_MCP].y,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.INDEX_FINGER_PIP].y,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.INDEX_FINGER_DIP].y,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.INDEX_FINGER_TIP].y,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.MIDDLE_FINGER_MCP].y,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.MIDDLE_FINGER_PIP].y,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.MIDDLE_FINGER_DIP].y,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.MIDDLE_FINGER_TIP].y,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.RING_FINGER_MCP].y,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.RING_FINGER_PIP].y,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.RING_FINGER_DIP].y,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.RING_FINGER_TIP].y,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.PINKY_MCP].y,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.PINKY_PIP].y,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.PINKY_DIP].y,
            hand_landmarks.landmark[self.mp_hands.HandLandmark.PINKY_TIP].y,
        ]
        minx = min(xlist) * width
        maxx = max(xlist) * width
        miny = min(ylist) * height
        maxy = max(ylist) * height
        cv2.rectangle(
            image, (int(minx), int(miny)), (int(maxx), int(maxy)), (255, 255, 255), 5
        )

In order to draw just the hand landmark points, we can use Mediapipe provided APIs, draw_landmarks method of drawing_utils.py class. But if in the above code I am taking a liberty of drawing a rectangle on top of hands just to  see the boundaries. You can find the details of the points at this image. As mentioned, we are going to detect 5 gestures and 3 functions using them

class HandGesture(Enum):
    OPEN = 1
    CLOSE = 2
    DRAW = 3
    ZOOM = 4
    ROTATE = 5

As the images are flipped, you see left hand in the left side of each images, so if you want to use gestures to control the webcam feed, then close the left hand and make the appropriate gesture to do the operation. In following code, we will see how we use left and right handedness.

for idx, hand_landmarks in enumerate(results.multi_hand_landmarks):
	which_hand = results.multi_handedness[idx].classification[0].label
	if which_hand == "Left":
		left_hand_gesture = self.gesture_utils.determine_gesture(hand_landmarks)
		self.is_editable = left_hand_gesture == HandGesture.CLOSE
		self.image_utils.draw_hand_landmarks(drawable_img, hand_landmarks)
		which_hand = ""
	elif which_hand == "Right":
		if left_hand_gesture:
			image, drawable_img = self.perform_right_hand_operation(
				self.mp_hands, hand_landmarks, image, drawable_img
			)
			self.image_utils.draw_hand_landmarks(
				drawable_img, hand_landmarks
			)

In the code, I just use just 3 python files. HandController.py is the main entry point which does initialization of all the dependencies on Mediapipe, and it makes use of other 2 classes to achieve all the functionalities namely gesture_util.py and ImageUtil.py .As the name suggests, we are going to detect all gestures inside gesture_util class and imageutil is used to do zooming and rotation of webcam images, and also drawing hands, references onto images.

def determine_gesture(self, hand_landmarks):
	"""Method to determine the position of fingers irrespective of left or right hand. Details of points is available at
		https://google.github.io/mediapipe/images/mobile/hand_landmarks.png
	Args:
		hand_landmarks ((x,y) of all 21 hand landmarks): Hand landmark points position in terms of 2d coordinates

	Returns:
		HandGesture: If hand is open, close, or any other gesture
	"""
	is_thumb_closed = self.is_finger_closed(
		hand_landmarks.landmark[self.mp_hands.HandLandmark.THUMB_MCP].x,
		hand_landmarks.landmark[self.mp_hands.HandLandmark.THUMB_IP].x,
		hand_landmarks.landmark[self.mp_hands.HandLandmark.THUMB_TIP].x,
	)

	is_index_finger_closed = self.is_finger_closed(
		hand_landmarks.landmark[self.mp_hands.HandLandmark.INDEX_FINGER_PIP].y,
		hand_landmarks.landmark[self.mp_hands.HandLandmark.INDEX_FINGER_DIP].y,
		hand_landmarks.landmark[self.mp_hands.HandLandmark.INDEX_FINGER_TIP].y,
	)

	is_middle_finger_closed = self.is_finger_closed(
		hand_landmarks.landmark[self.mp_hands.HandLandmark.MIDDLE_FINGER_PIP].y,
		hand_landmarks.landmark[self.mp_hands.HandLandmark.MIDDLE_FINGER_DIP].y,
		hand_landmarks.landmark[self.mp_hands.HandLandmark.MIDDLE_FINGER_TIP].y,
	)

	is_ring_finger_closed = self.is_finger_closed(
		hand_landmarks.landmark[self.mp_hands.HandLandmark.RING_FINGER_PIP].y,
		hand_landmarks.landmark[self.mp_hands.HandLandmark.RING_FINGER_DIP].y,
		hand_landmarks.landmark[self.mp_hands.HandLandmark.RING_FINGER_TIP].y,
	)

	is_pinky_finger_closed = self.is_finger_closed(
		hand_landmarks.landmark[self.mp_hands.HandLandmark.PINKY_PIP].y,
		hand_landmarks.landmark[self.mp_hands.HandLandmark.PINKY_DIP].y,
		hand_landmarks.landmark[self.mp_hands.HandLandmark.PINKY_TIP].y,
	)
	print()
	if (
		is_thumb_closed
		and is_index_finger_closed
		and is_middle_finger_closed
		and is_ring_finger_closed
		and is_pinky_finger_closed
	):
		return HandGesture.CLOSE
	elif (
		not is_thumb_closed
		and not is_index_finger_closed
		and is_middle_finger_closed
		and is_ring_finger_closed
		and is_pinky_finger_closed
	):
		return HandGesture.ZOOM
	elif (
		not is_thumb_closed
		and not is_index_finger_closed
		and not is_middle_finger_closed
		and is_ring_finger_closed
		and is_pinky_finger_closed
	):
		return HandGesture.ROTATE
	elif (
		is_thumb_closed
		and is_middle_finger_closed
		and is_ring_finger_closed
		and is_pinky_finger_closed
		and not is_index_finger_closed
	):
		return HandGesture.DRAW
	else:
		return HandGesture.OPEN

def is_finger_closed(self, pip, dip, tip):
	"""Method to determine if finger is closed or open

	Returns:
		boolean: True if finger is closed, else false
	"""
	if dip < pip and tip < dip:
		return False
	elif pip < dip and dip < tip:
		return True
	else:
		return True

In the above code block, we determine the gesture of the detected hand, whether it is left or right. You can find the points reference at Mediapipe hand image. is_finger_closed is the common method used to determine if each of the fingers are closed or open. Based on this result, we determine the gesture of the hand. Here we are not talking about the temporal gesture but rather the static hand gesture at a particular instance of image of webcam feed.

def get_angle(self, hand_landmarks):
	"""Method to find angle of rotation w.r.t thumb and middle finger

	Args:
		hand_landmarks (Mediapipe hand landmarks): Mediapipe hand landmarks

	Returns:
		Degree of rotation: Degree
	"""
	x1 = hand_landmarks.landmark[self.mp_hands.HandLandmark.THUMB_TIP].x
	y1 = hand_landmarks.landmark[self.mp_hands.HandLandmark.THUMB_TIP].y
	x2 = hand_landmarks.landmark[self.mp_hands.HandLandmark.MIDDLE_FINGER_TIP].x
	y2 = hand_landmarks.landmark[self.mp_hands.HandLandmark.MIDDLE_FINGER_TIP].y

	radians = math.atan2(y1 - y2, x1 - x2)
	degrees = math.degrees(radians)

	return degrees

This block of code finds the angle between tip of the thumb and tip of the middle finger, which is used to rotate the image frame. We first store the initial angle and then from that reference calculate the change of rotational angle and rotate accordingly the image frame. This is how we achieve rotation functionality. 

def perform_right_hand_operation(self, mp_hands, hand_landmarks, image, drawable_img):
	self.image_height, self.image_width, _ = image.shape
	x1 = (
		hand_landmarks.landmark[mp_hands.HandLandmark.THUMB_TIP].x
		* self.image_width
	)
	y1 = (
		hand_landmarks.landmark[mp_hands.HandLandmark.THUMB_TIP].y
		* self.image_height
	)
	x2 = (
		hand_landmarks.landmark[mp_hands.HandLandmark.INDEX_FINGER_TIP].x
		* self.image_width
	)
	y2 = (
		hand_landmarks.landmark[mp_hands.HandLandmark.INDEX_FINGER_TIP].y
		* self.image_height
	)
	length = math.hypot(x2 - x1, y2 - y1)
	factor = length / 5

	if self.is_editable:
		right_hand_gesture = self.gesture_utils.determine_gesture(
			hand_landmarks=hand_landmarks
		)
		if right_hand_gesture == HandGesture.ZOOM:
			curr_length = length
			if self.curr_factor != -100:
				image = self.image_utils.zoom_image(
					image, self.curr_factor - factor
				)
				start_xy = (int(x1), int(y1))
				end_xy = (int(x2), int(y2))
				self.image_utils.draw_hand_reference(
					drawable_img, (int(x1), int(y1)), (int(x2), int(y2))
				)
			self.curr_factor = factor
		elif right_hand_gesture == HandGesture.ROTATE:
			degrees = self.gesture_utils.get_angle(hand_landmarks)
			if self.rotate_factor == -100.0:
				self.rotate_factor = degrees
			else:
				image = self.image_utils.rotate_image(
					image, int(self.rotate_factor) - int(degrees)
				)
				self.image_utils.draw_hand_reference(
					drawable_img, (int(x1), int(y1)), (int(x2), int(y2))
				)
		elif right_hand_gesture == HandGesture.DRAW:
			x = (
				hand_landmarks.landmark[mp_hands.HandLandmark.INDEX_FINGER_TIP].x
				* self.image_width
			)
			y = (
				hand_landmarks.landmark[mp_hands.HandLandmark.INDEX_FINGER_TIP].y
				* self.image_height
			)
			self.drawable_xy.append((int(x), int(y)))
			self.image_utils.draw_on_screen(image, self.drawable_xy)

	return image, drawable_img

In perform_right_hand_operation method, we can find the logic of zooming in and out. At first, we find the distance between tip of the thumb and tip of the index finger, and then based on user pinching, we update the change in distance between thumb and index finger. This change in distance is used to update the scale of the image. I could have extracted the zooming effect into a seperate method in gesture_util.py but I realised lately. Please feel free to make it better.

In conclusion, wanted to demonstrate the usages of Mediapipe to control a particular application, in our case it is webcam, but it could be used in vast majority of applications where visual input is available. The only drawback with visual control is that it is not mature enough to be used in real time applications because if the lighting condition is bad then model fails to predict the hand landmarks. And here code is very minimal and self explainatory, please feel free to clone the git repo and please do comment some interesting usecases aroung visual control or improvements to existing code. 

Comments are closed.