Data generation using Google's Mediapipe

August 07, 2024

The whole project development process can be divided into 3 main segments:

• Data generation

• Building, training and testing of the model

• ROS integration.

The following blog post will be focused on the data generation process. Selecting all the signs to be identified by the model was done and explained in the previous post and this will focus on the development of the program to capture hand landmarks. A brief introduction of the mediapipe library is also included in the post for your reference.

Google Mediapipe:

This is an open source project maintained by google which provides customizable machine learning solutions[1]. The hand detection implementation of media-pipe is used in this project to collect hand information to generate the dataset.MediaPipe Hands is a hand-tracking technology that predicts hand skeletal data using a single RGB camera. Two interdependent models—a palm detector model and a hand landmark model—make up the solution. To identify the initial hand locations and crop out the bounding box, a single-shot detector and an orientated hand bounding box are utilized. The hand landmark model extracts and localizes the 21 2.5D landmark coordinates using the cropped hand box. The coordinates for each of the 21 landmarks are x, y, and z, where z indicates the relative depth. The model is capable of identifying self-occlusions and hands that are only half visible. [3]

Figure 1: Hand Landmarks

Prerequisites:

!pip install opencv-python mediapipe

Saving hand landmark details as a dataset:

In order to detect hand landmarks and save them in a file, a separate python program was used.

MediaPipe detects the positions of hand landmarks as X, Y, and Z coordinate pairs. The Z coordinate, representing the depth (distance from the camera), is omitted as it is not necessary for this particular application. Instead, the focus is on the X and Y coordinates, which are the horizontal and vertical positions on the image.

After capturing the landmarks, the X and Y coordinates are processed relative to the wrist (0th landmark) of the hand. This preprocessing step involves calculating the distances of other landmarks from the wrist and normalizing these distances. Normalizing the values means transforming them into a common scale, making the dataset represent the ratios of distances between joints rather than raw pixel distances.

This normalization is crucial as it mitigates variations caused by different distances between the hand and the camera. By representing the distances as ratios, the dataset becomes more consistent and meaningful, improving the robustness and accuracy of gesture recognition models trained on this data.

In order to process the native landmark list in to a python list the following GitHub project code was referred[4]. The python program designed to collect data can be used as follows. During the execution of the program, it is required to provide a label associated with the sign as a command line argument.

Figure 2: Command to execute the program

In the above example, the label given to the sign to be collected is ‘signA’. This argument will be used as the label in each row along with the hand landmark coordinates in the dataset. Once the program is executed, a preview of the webcam is displayed on the screen along with the preview of the hand landmarks detection from mediapipe overlaid. The program won’t start recording until a hand is detected and recording is started manually. Once the hand is detected and when the user is ready to start recording data, press and hold number ‘0’ in the keyboard. The program will start writing the landmark details to the csv file along with the label provided in the beginning. At the same time, a count of data rows collected in that particular session will also displayed on the preview pane. To stop the process, depress the ‘0’ key. To exit from the program, press and hold escape key. All the collected data will be saved in to a csv file named “keypoints.csv” in a folder named dataset. Later this can be used to train the MLP model which will be described in an upcoming post. The same process will have to be done for each hand sign. In each and every iteration, the new data will be appended to the bottom of the “keypoints.csv” file.

You can view the complete code on : HandSignLogging.py

In the next blog post, I will share a video showcasing the data collection process, discuss the data preprocessing steps in detail, and reveal the initial size of the dataset.

References

[1] “MediaPipe.” https://mediapipe.dev/ (accessed Jul. 05, 2024).

[2] “hand-landmarks.png (2146×744).” https://developers.google.com/static/mediapipe/images/solutions/hand-landmarks.png (accessed Jul. 05, 2024).

[3] M. Marais, D. Brown, J. Connan, and A. Boby, “An Evaluation of Hand-Based Algorithms for Sign Language Recognition,” 5th Int. Conf. Artif. Intell. Big Data, Comput. Data Commun. Syst. icABCD 2022 - Proc., 2022, doi: 10.1109/ICABCD54961.2022.9856310.

[4] Kazuhito00, “hand-gesture-recognition-using-mediapipe/README_EN.md at main · Kazuhito00/hand-gesture-recognition-using-mediapipe.” https://github.com/Kazuhito00/handgesture-recognition-using-mediapipe/blob/main/README_EN.md (accessed Apr. 21, 2023).

Search This Blog

Developing an Intelligent Saudi Sign Language Translator for Inclusive Communication

Data generation using Google's Mediapipe

Comments

Post a Comment

Popular posts from this blog

Understanding Saudi Arabic Sign Language Gestures

Understanding and implementation of Multilayer Perceptrons (MLP)