Audio Integration (Google'sText-to-Speech API) and Final Implementation
In this blog, I will walk you through a audio integration and final implementation on hand gesture recognition system that utilizes the Mediapipe framework along with machine learning models.
Audio Integration:
After spending several weeks on how to detect hand signatures, I finally have a model capable of detecting and recognizing all the hand gestures mentioned in previous posts. Now it’s time to go beyond text-to-speech. The basic idea is to move from just detecting and printing the detected sign on the screen to having the program speak the detected sign out loud. The main task now is to explore the options available to implement this feature.
During my online research, I identified two main options:
- Using Python libraries to convert text to speech
- Using an online service to convert text to speech
To give you an idea of the pros and cons of each option, I will describe the information I found under both topics.
Built-In Option
Since the main program is built using Python, it is logical to start by looking for options that are compatible with Python. One library I found was called ‘pyttsx3’. You can install it simply by using the following command:
pip install pyttsx3 The main advantage of this option is the simplicity of the approach. It requires only four lines of code:
Simply send the text as a string to the function, and it will speak the text out loud with good audio quality and clarity.
However, the downsides of this option are significant. The library is not suitable for other languages. Since my project involves Arabic, the library struggles with accurate pronunciation compared to a native Arabic speaker. Thus, for my project, this is not a suitable option.
Online Option
When selecting an online option, there are two main factors to consider:
- The quality of the audio
- The time delay to process the audio
The very first online option I found was Google Text-to-Speech. Along with web links in the search results, there were YouTube tutorials that explained and showed how to use the web service through an API. Instead of reading through the documentation at Google Text-to-Speech, I decided to watch the tutorials first (just like any normal person would). Here is the first tutorial I watched, which was perfect: Google Text-to-Speech Tutorial.
The tutorial guided me through accessing the service. I needed to create an account for Google Cloud Services and connect a credit or debit card. Fortunately, Google offers $300 in credits to kickstart the project. However, the service options I plan to use do not consume these credits.
I selected the voice of an Arabic woman as my base voice. When I sent the text to convert to speech, I was amazed by the quality. However, the issue was that the text was recited with an Arabic accent, while the model predicted sign labels in English. To resolve this, I created a dictionary mapping English terms to their Arabic counterparts. Sending the Arabic text to the API produced excellent results (at least according to my ears). I plan to get feedback from an Arabic-speaking friend to confirm the quality.
Even though the API provides a response, it’s still necessary to play the audio out loud. For this, I used another Python library called ‘pydub’. It allows me to directly extract the audio from the API response and play it.
I faced an error regarding ffmpeg, incase if you face you can follow this : ffmpeg installation
Final Implementation
- Sign Language Model: This model is designed to recognize single-hand gestures that correspond to letters or symbols in sign language.
- Number Recognition Model: This model detects gestures made with both hands to represent numbers.
These models were trained using multilayer perceptrons on a large dataset, with hyperparameter tuning performed to optimize their performance.
Code Overview
Initialization
The HandGestureRecognition class serves as the backbone of the system. During initialization, I have set up Mediapipe for hand tracking and load the pre-trained models for sign and number recognition. Additionally, configured the Google Text-to-Speech (TTS) client to vocalize recognized gestures.
Gesture Detection
The core functionality is implemented in the detect_hands method, which captures video input from the webcam. Each frame is processed to detect hand landmarks using Mediapipe's hand tracking capabilities. The method checks how many hands are detected:
- Single Hand Detection: When one hand is detected, the model for sign language is used to predict the gesture. It processes the detected landmarks, normalizes their coordinates, and inputs them into the model to obtain a prediction.
- Two-Hand Detection: When two hands are present, the number recognition model is employed. The landmarks from both hands are combined and processed similarly, enabling the system to interpret numerical gestures.
Text-to-Speech Integration
Once a gesture is recognized consistently for a specified duration (e.g., five seconds), the system invokes the text-to-speech function. This function converts the recognized gesture into speech, allowing for auditory feedback. The synthesized speech is saved to an audio file and played back to the user.
Testing and Results
The system was tested in real-time, capturing video input and evaluating gesture recognition performance. The models successfully identified gestures with high accuracy, although some challenges arose with specific hand positions or overlapping gestures. Overall, the system demonstrated a robust ability to recognize both sign language and numerical gestures.
Future Improvements
- Expanded Gesture Support: Adding more gestures to the vocabulary recognized by the models.
- Improved Robustness: Enhancing the models to handle noisy environments or occlusions better.
- User Interface Development: Creating a more user-friendly interface to improve user experience.
Conclusion
This hand gesture recognition system highlights the synergy between computer vision and machine learning technologies. With applications ranging from assisting the hearing-impaired to enabling more intuitive human-computer interactions, the potential is vast. I hope this blog inspires you to delve deeper into the world of hand gesture recognition!

Comments
Post a Comment