How to perform 3D human pose estimation in AI fitness coach apps

Using Human Pose Estimation in Fitness & Rehab Therapy Apps

Human Pose Estimation or HPE, is a task in computer vision designed to recognize and accurately track specific points on the human body. These points then can be used to identify motion patterns, specific joint position, or the pose itself. The following capabilities can be applied to numerous downstream tasks like body motion detection, applications for posture correction, exercise supervision or AI fitness coaching. 

However, the possibility to apply certain machine learning techniques to a specific task always depends on the available data and conditions. In this article, we’ll elaborate on all the aspects of developing fitness applications with Human pose estimation, and cover related cases that apply similar principles and business logic. 

What is Human Pose Estimation?

Human pose estimation is a task in computer vision, where the model tries to identify the key points on the human body, like limbs and joints, which can help us determine the pose a person is in right now. With HPE models we can dynamically track those points through motion in real time. Which basically means, we can analyze motion patterns and make further decisions based on this input. 

Most HPE methods are based on recording an RGB image with an optical sensor, like a smartphone camera or surveillance lens, to detect body parts and track them through motion in 3D space. In conjunction with other computer vision techniques, this allows for automating routine tasks in fitness, coaching, rehabilitation, surveillance, or even some AR applications like virtual fitting rooms. 

The most basic form of Human pose estimation is 2D point extraction. This entails a model that will only take into account 2D space and generally brings poor accuracy because there is no depth perception. A more widespread approach is 3D pose estimation models, which are generally quite accurate at tracking given there are decent light conditions. Both methods are often used in conjunction, because the 2D approach is faster at detecting actual key points, while 3D grants accuracy and correct perception of the space.

2D representation of a Albert Einstein body pose

2D representation of a Albert Einstein body pose
Image source

There are three common types of human models: skeleton-based model, contour-based, and volume-based. The skeleton-based model is currently the most commonly used one in human pose estimation because of its flexibility.

Body models in human pose estimation

Body models in human pose estimation
Image source

The output of the human pose estimation model is a map of key points resembling a skeleton we can manipulate in various ways. Depending on the task, there will be additional parameters like height of a person, or the length of an object that are extracted via pixel value calculation. All of this data can be used for comparative analysis of exercise execution, coaching, or performance analytics visualized as graphs.

How 3D Human Pose Estimation Works

The overall flow of a body pose estimation system starts with capturing the initial data and uploading it for a system to process. There might be multiple models that will process the sequence of frames while also considering past data to provide more accurate estimation of a pose. 

In the first phase, the HPE model will analyze each frame and detect key points on the human body. Some models use separate modules to work with 2D coordinates as they are much faster to extract and then interpret them into 3D space.

The difference between 2D and 3D pose estimation reconstructions

The difference between 2D and 3D pose estimation reconstructions

Video source

This enables accurate key point tracking through each movement and deals with motion blur that can occur during fast motion in bad lighting conditions. 

So for the majority of human pose estimation tasks, the flow will be broken into two parts:

  • Detecting and extracting 2D key points from the sequence of images. This entails using horizontal and vertical coordinates that build up a skeleton structure.
  • Converting 2D key points into 3D, and adding the depth dimension. 

However, often there is a whole spectrum of problems we have to deal with, while developing such solutions.

Human Pose Estimation Fitness Product Development Challenges 

The most common challenges that we face in fitness product development that entail human pose estimation are either related to poor data quality, or the task complexity itself. 

Variability in Poses: Even in the context of a single sports discipline, there are thousands of different poses a model needs to recognize. The number increases once we take into account body shapes, difference in technique, and clothing. To overcome this challenge, we need to collect lots of quality data and use post-processing to increase the accuracy of key point tracking.

Occlusion Handling: In real-world scenarios, body parts can be partially or completely occluded by objects or other body parts. The application should be able to handle occlusions and still provide accurate pose estimates.

Multi-person Pose Estimation: In scenarios where multiple people are present, accurately estimating poses for all individuals in the frame can be challenging, especially when they interact or occlude each other.

Model Complexity and Size: More advanced models with high accuracy may be computationally expensive and have a large memory footprint. Balancing model complexity and performance is a challenge for deployment on various devices, 

Limited Data and Annotated Datasets: Training pose estimation models requires large and diverse annotated datasets. For highly specific movements that may occur in the rehabilitation industry and sports, custom data is required to train the model.

With all that, there is a list of things we need to manage concerning the data:

  • Difference in frame rate from sample to sample
  • Poor lighting conditions
  • Inappropriate camera angles
  • Artifacts like rolling-shutter effect, color bending, exposition changes, etc
  • Low resolution videos
  • A user wearing clothes that interfere with detection of the keypoints (like dresses, robes, oversized clothes)
  • A user wearing clothes that makes them blend with the background

Privacy concerns, unlike other computer vision techniques, are not really relevant in human pose estimation projects. This is because  we can transfer and store only the keypoint information. Even in cases where the model detects a user’s face and head movement, this data is anonymous, because we extract coordinates from the image and manipulate them to achieve pose estimation results, rather than storing personal data like in face recognition tasks. Although, these details depend on the application case and project requirements. 

TOP 5 Human Pose Estimation-based Use Cases

HPE can be considered quite a mature technology since there are groundworks in the areas of applications for fitness, rehabilitation, augmented reality, animation, gaming, robotics, and even surveillance. So now let’s talk about the existing use cases.

1. AI fitness and training applications

Human pose estimation got the most attention in the context of AI fitness applications, as it can be applied to analyze movements of athletes in different scenarios using just a smartphone camera. HPE-based fitness apps can be generally split into two categories:

  1. Sports performance analytics. Those applications provide athletes with the insights on how they perform a certain movement over a period of time, and can show accurate metrics for exercises. These can be the height of a hip in a jump, lever angle in power movements, changes in technique between repetitions, etc.
  2.  AI coaching and corrective feedback. This category is meant to show whether a user is performing the exercise correctly technique-wise. Such hints might include posture correction, biomechanic tips, and overall mentoring through comparative training.

BeONE Sports is an example of a successful implementation of human pose estimation for fitness purposes. Created in collaboration with MobiDev, BeONE Sports is a platform for athletes that utilizes video recordings to analyze movement patterns and provide athletes with the performance analytics. The app estimates the performance, shows accuracy scoring, and gives tips for technique improvements. Along with that, users can compare their performance with the listed experts or their own workout records, which is a perfect tool for self-coaching. 

The technology involved in BeONE Sports is based on a fast-performing MediaPipe model over iOS devices (with Android and Web upcoming releases) that delivers results to the user within a few seconds. This is a great index for human pose estimation solutions. MediaPipe optimizes the consumption of CPU, GPU, and memory for processing high-dimensional data, resulting in preserved battery life, minimized power consumption, and smooth performance on resource-constrained devices. 

If you are interested in learning more about the BeOne Sports development process, check out our case study.

Tracking the movement of a human body, the exercise can be split into phases of eccentric and concentric movements to analyze different angles of flexion and overall posture. This is done via tracking the keypoints and providing analytics in the form of hints or graphic analysis. Although, the time for processing and general accuracy are heavily dependent on the movement type and data quality itself.

2. Rehabilitation and physiotherapy applications

Another case of HPE application is tracking human activity through rehabilitation exercises. The major difference between fitness and rehab cases is that we need much more accuracy in detecting keypoints and how they change during the movement, because this is a critical aspect in threatening injuries. 

This category of applications relates to the telemedicine industry, which is why healthcare regulations will apply to the project requirements. However, as we mentioned earlier, human pose estimation is a good fit for this type of project, because we can track and store only key point information without preserving any user data. 

3. Virtual shopping applications

Augmented reality-based applications like virtual fitting rooms can benefit from human estimation as one of the most advanced methods of detecting and recognizing the position of a human body in space. This can be used in ecommerce where shoppers struggle to  test the fit of their clothes before buying. 

Human pose estimation can be applied to track key points on the human body and pass this data to the augmented reality engine that will fit clothes on the user. This can be applied to any body part and type of clothes, or even face masks. We’ve described our experience of using human pose estimation for virtual fittings rooms in a dedicated article.

4. Animation and gaming applications

Game development is a tough industry with a lot of complex tasks that require knowledge of human body mechanics. Body pose estimation is widely used in animation of game characters to simplify this process by transferring tracked key points in a certain position to the animated model. 

The process of this work resembles motion tracking technology used in video production, but doesn’t require a large number of sensors placed on the model. Instead, we can use multiple cameras to detect the motion pattern and recognize it automatically. The data fetched then can be transformed and transferred to the actual 3D model in the game engine. 

5. Surveillance and human activity tracking apps

Some surveillance cases don’t require spotting a crime in a crowd of people. Instead, cameras can be used to automate everyday processes like shopping at a grocery store. 

Cashierless store systems like Amazon GO, for example, apply human pose estimation to understand whether a person took some item from a shelf. HPE is used in combination with other computer vision technologies, which allows Amazon to automate the process of checkout in their stores using a network of camera sensors, IoT devices, and 

Human pose estimation is responsible for the part of the process where the actual area of contact with the product is not visible to the camera. So here, the HPE model analyzes the position of customers’ hands and heads to understand if they took the product from the shelf, or left it in place.

Now that we’ve discussed the basics of HPE and its broad range of application, let’s look deeper into specific aspects of it like the difference between 2D and 3D key point detection, accuracy of movements tracking, training procedure, additional features.

Our Approach to Building Real-time 3D Human Pose Estimation-based Applications

Whether we deal with a fitness app, an app for rehabilitation, face masks, or surveillance, real-time processing is highly essential. Of course, the performance of the model will depend on the chosen algorithm and hardware, but the majority of existing open-source models provide quite a long response time. In the opposite scenario, the accuracy suffers. So is it possible to improve existing 3D human pose estimation models to achieve acceptable accuracy with real-time processing?

While models like BlazePose are able to provide real-time processing, the accuracy of their tracking is not suitable for commercial use or complex tasks. In terms of our experiment, we tested the 2D component of a BlazePose with a modified 3D-pose-baseline model using Python development. 

In terms of speed, our model achieves about 46 FPS on the above-mentioned hardware without video rendering where the 2D pose detection model produces keypoints with about 50 FPS. In comparison to the 2D pose detection model, the modified 3D baseline model can produce keypoints with about 780 FPS. The detailed information about the spent processing time of our approach is presented below.

BlazePose 2D + 3D-pose-baseline performance in percents

BlazePose 2D + 3D-pose-baseline performance in percent

While this approach doesn’t guarantee reliability in complex scenarios with dim lighting or unusual poses, standard videos can be processed in real time. Generally, the accuracy of model predictions will depend on the training and the chosen architecture.

3D Pose estimation Performance and Accuracy

Accuracy of key point tracking is basically the main parameter that determines how successful our model is at estimating the position of a human body. For the sake of explanation, let’s take VideoPose3D and BlazePose for comparison, and see how they correlate in terms of performance and accuracy. 

We’ve tested BlazePose and VideoPose3D models on the same hardware using a 5-second video with 2160*3840 dimensions and 60 frames per second. VideoPose3D got a total time of 8 minutes for video processing and a good accuracy result. In contrast, BlazePose processing time reached 3-4 frames per second, which allows for use in real-time applications. But the accuracy results shown below don’t correspond to the objectives of any HPE task.

VideoPose3D and BlazePose processing results

VideoPose3D and BlazePose processing results

Video source

The processing time depends on the movement complexity, video and lighting quality, and the 2D pose detector module. Given the fact that BlazePose and VideoPose3D have different 2D detectors, this stage appears to be a performance bottleneck in both cases.

One of the possible ways to optimize HPE performance is the acceleration of 2D keypoint detection. Existing 2D detectors can be modified or amplified within the post processing stages to improve general accuracy.

How to Train a Human Pose Estimation Model?

Human pose estimation is a machine learning technology, which means you’ll need data to train it. Since human pose estimation completes quite difficult tasks of detecting and recognizing multiple objects on the screen, neural networks are used as an engine for it. Training a neural network requires enormous amounts of data, so the most optimal way is to use available datasets like the following ones:

The majority of these datasets are suitable for fitness and rehab applications with human pose estimation. But this doesn’t guarantee high accuracy in terms of more unusual movements or specific tasks like surveillance or multi-person pose estimation.

How to Avoid Training Human Pose Estimation from Scratch?

Human pose estimation models are appearing rapidly, as the technology is live and progressive. This gives us options in terms of pretrained models tailored for different tasks. To analyze existing approaches and models, we used Human3.6M as an evaluation dataset.

Evaluation of open source HPE model performance using Human3.6M dataset

Evaluation of open source HPE model performance using Human3.6M dataset

The evaluation metric is MPJPE (Mean Per Joint Position Error) which shows the distance averaged over all joints  measured in millimeters. In other words, this metric shows how accurately each specific model detects joints over time. The graph represents the analysis of several open-source models trained for human pose estimation tasks. 

In terms of our experiment with such models, we can conclude that some of them can be modified to implement real-time processing with comparably high FPS. The performance of the model for the most part depends on its 2D detector module, which enables us to implement a high-performance model for  most business cases, including mobile applications.

3D Mesh Reconstruction with Human Pose Estimation

A separate task in human pose estimation, is a reconstruction of a human body from a photograph or video frame. While traditional 2D human pose estimation predicts the 2D coordinates of each body joint, with 3D mesh we can estimate the depth for each joint from the camera and build up the human body considering its volume.

3D mesh reconstruction of human body

3D mesh representation

Source: EasyMocap 

There area number of different methods for extracting the depth information that is required to recreate the volume of objects:

Depth sensors like lidars can provide the system with additional data on depth which can be used to build up the volume of a 2D body.

Multi-View Reconstruction entails the use of multiple cameras to shoot the object from different angles. The combined images or video frames are then used to estimate the pose information and reconstruct 3D mesh more accurately. Frameworks like EasyMocap are available for working with multi-view reconstruction on top of the pose estimation.

Monocular Depth Estimation relies on single-camera images to estimate 3D depth through techniques like structure-from motion, depth from focus, or using deep learning-based estimation models. 

3D mesh is a basis of technologies like virtual fitting rooms, virtual avatars, try-on applications, and 3D modeling. While there are ways to fetch depth data, 3D mesh reconstruction is a tedious task that requires a lot of data gathering from different sensors. W This is why it is not that common for  use in real-time, rather than being used for stationary things like motion capture technology used in cinema production.

Human Pose Estimation to Enhance Fitness App Development

Fitness industry is the one benefiting the most from the development of motion tracking technologies. Because algorithms like human pose estimation can replace smart wearables and doesn’t require much gear to use properly from a user. While it is early to speak about AI coaching as a standalone discipline, there are other fields that HPE come in handy, like technique correction or performance comparison with other trainers. However, human pose estimation projects can get quite complex and require expertise in a number of domains. If you’re planning such a project, or need machine learning consulting, contact us to discuss details and get rough estimates.

Want to get in touch?

contact us
Contents Schedule a meeting
Open Contents

How to Build an AI Assistant: Virtual Assistant Technol…

Mobile Accessibility Testing Guide for Product Owners

Mobile Accessibility Testing Guide for Product Owners

Python app development guide

Python App Development: In-Depth Guide for Product Owne…

We will answer you within one business day