AR & AI Technologies For a Virtual Fitting Room Development

Virtual Fitting Room Development Using AR & AI Technologies

Download PDF

I hate shopping in brick and mortar stores. However, my interest in virtual shopping is not limited to the buyer experience only. With the MobiDev DataScience department, I’ve gained experience in working on AI technologies for virtual fitting. The goal of this article is to describe how these systems work from the inside.

How Virtual Fitting Technology Works

A few years ago, the “Try before you buy” strategy was an efficient customer engagement method in outfit stores. Now, this strategy exists in the form of virtual fitting rooms. Fortune Business Insights predicted that the virtual fitting room market size is expected to reach USD 10.00 billion by 2027.

To better understand the logic of virtual fitting room technology, let’s review the following example. Some time ago, we had a project of Augmented Reality (AR) footwear fitting room development. The fitting room works in the following way:

  1. The input video is split into frames and processed with a deep learning model which estimates the position of a set of specific leg and feet keypoints.
  2. A 3D model of footwear is placed according to the detected keypoints to display the orientation to a user naturally.
  3. A 3D footwear model is rendered so that each frame displays realistic textures and lighting.

Utilization of ARKit for 3D human body pose estimation and 3D model rendering


When working with ARKit (Augmented Reality framework for Apple’s devices) we discovered that it has rendering limitations. As you can see from the video above, the tracking accuracy is too low to use it for footwear positioning. The cause of this limitation may be the maintenance of the inference speed while neglecting the tracking accuracy, which might be critical for apps working in real-time.

One more issue was the poor identification of body parts by the ARKit algorithm. Since this algorithm is aimed to identify the whole body, it doesn’t detect any keypoints if the processed image contains only a part of the body. It is exactly the case of a footwear fitting room when the algorithm is supposed to process only a person’s legs. 

The conclusion was that virtual fitting room apps might require additional functionality along with the standard AR libraries. Thus, it’s recommended to involve data scientists for developing a custom pose estimation model supposed to detect keypoints on only one or two feet in the frame and operate in real-time.
Augmented Reality Development Services MobiDev

Virtual Try-on Solutions Overview

The virtual fitting room technology market provides offerings for accessories, watches, glasses, hats, clothes, and others. Let’s review how some of these solutions work under the hood.

Virtual Try-on Watches

A good example of virtual watches try-on is the AR-Watches app allowing users to try on various watches. The solution is based on the ARTag technology utilizing specific markers printed on a band, which should be worn on a user’s wrist in place of a watch in order to start a virtual try-on the watch. The computer vision algorithm processes only those markers visible in the frame and identifies the camera’s position in relation to them. After that, to render a 3D object correctly, the virtual camera should be placed at the same location.

Overall, technology has its limits (for instance, not everybody has a printer at hand to print out the ARTag band). But if it matches the business use case, it wouldn’t be that difficult to create a product with a production-ready quality. Probably, the most important part would be to create proper 3D objects to use.

3D model rendering of a watch using the ARTag technology (source)

Virtual Try-on Shoes

Wanna Kicks and SneakerKit apps are a good demonstration of how AR and deep learning technologies might be applied for footwear.

Virtual shoes try-on, Wanna Kick app (source)

Technically, such a solution utilizes a foot pose estimation model based on deep learning. This technology may be considered for a particular case of widespread full-body 3D pose estimation models that estimate the position of selected keypoints in 3D dimension directly or through the inference of detected 2D keypoints’ positions into 3D coordinates.

3D foot pose estimation

3D foot pose estimation (source)

Once positions of 3D keypoints of feet are detected, they can be utilized for creating a parametric 3D model of a human foot, and positioning & scaling of a footwear 3D model according to the geometric properties of the parametric model.

Positioning of a 3D model of a footwear on top of a detected parametric foot model

Positioning of a 3D model of footwear on top of a detected parametric foot model (source)

Compared to the full-body/face pose estimation model, foot pose estimation still has certain challenges. The main issue is the lack of 3D annotation data required for model training. 

However, the optimal way to avoid this problem is to use the synthetic data which supposes rendering of photorealistic 3D human feet models with keypoints and training a model with that data; or to use photogrammetry which supposes the reconstruction of a 3D scene from multiple 2D views to decrease the number of labeling needs.

This kind of solution is way more complicated. In order to enter the market with a ready-to-use product, it is required to collect a large enough foot keypoint dataset (either using synthetic data, photogrammetry, or a combination of both), train a customized pose estimation model (that would combine both high enough accuracy and inference speed), test its robustness in various conditions and create a foot model. We consider it a medium complexity project in terms of technologies.

Virtual Try-on Glasses

FittingBox and Ditto companies considered AR technology for the virtual glasses try-on. The user should choose a glasses model from a virtual catalog and it is put on his/her eyes.

Virtual glasses try-on and lenses simulation (source)

This solution is based on the deep learning-powered pose estimation approach utilized for facial landmarks detection, where the common annotation format includes 68 2D/3D facial landmarks.

68 facial landmarks for face pose estimation

Example of facial landmark detection in video. Note that the model in the video detects more than 68 landmarks (source)

Such an annotation format allows the differentiation of face contour, nose, eyes, eyebrows, and lips with a sufficient accuracy level. The data for training the face landmark estimation model might be taken from such open-source libraries as Face Alignment, providing face pose estimation functionality out-of-the-box.

In terms of technologies, this kind of solution is not that complicated, especially if using any pre-trained model as a basis for the face recognition task. But it’s important to consider that low-quality cameras and poor light conditions could be limiting factors.

Virtual Try-on Surgical Masks

Amidst the COVID-19 pandemic, ZapWorks launched the AR-based educational app aimed to instruct users on how to wear surgical masks properly. Technically, this app is also based on a 3D facial landmark detection method. Like the glasses try-on app, this method allows receiving information about facial features and further mask rendering.

AR for mask wear guide (source)

Virtual Try-on Hats

Given the fact that facial landmark detection models work well, another frequently simulated AR item is hats. Everything required for correct rendering of a hat on the person’s head is the 3D coordinates of several keypoints indicating temples and the location of a forehead center. The virtual hats try-on apps have already been launched by QUYTECH, Banuba, and Vertebrae.

Baseball cap try-on (source)

Virtual Try-on Clothes

Compared to shoes, masks, glasses, and watches, virtual try-on 3D clothing still remain a challenge. The reason is that clothes are deformed when taking the shape of a person’s body. Thus, for proper AR experience, a deep learning model should identify not only basic keypoints on the human body’s joints but also the body shape in 3D. 

Looking at one of the most recent deep learning models DensePose aimed to map pixels of an RGB image of a person to the 3D surface of the human body, we can find out that it’s still not quite suitable for augmented reality. The DensePose’s inference speed is not appropriate for real-time apps, and body mesh detections have insufficient accuracy for the fitting of 3D clothing items. In order to improve results, it’s required to collect more annotated data which is a time and resource-consuming task. 

The alternative is to use 2D clothing items and 2D people’s silhouettes. That’s what Zeekit company does, giving the users a possibility to apply a number of clothing types (dresses, pants, shirts, etc.) to their photo.

2D clothing try-on, Zeekit (source)

Strictly speaking, the method of 2D clothes images transferring cannot be considered as Augmented Reality, since the “Reality” aspect implies the real-time operation, however, it still can provide an unusual and immersive user experience. The behind technologies comprise Generative Adversarial Networks, Human Pose Estimation, and Human Parsing models. The 2D clothes transferring algorithm may look as follows:

  1. Identification of areas in the image corresponding to the individual body parts
  2. Detection of the position for identified body parts
  3. Producing of a warped image of a transferred clothing 
  4. Application of a warped image to the image of a person with the minimum produced artifacts

Our Experiments with 2D Clothes Transferring

Since there are no ready pre-trained models for the virtual dressing room we researched this field experimenting with the ACGPN model. The idea was to explore outputs of this model in practice for 2D cloth transferring by utilizing various approaches.

The model was applied to people’s images in constrained (samples from the training dataset, VITON) and unconstrained (any environment) conditions. In addition, we tested the limits of the model’s capabilities by not only running it on custom persons’ images but also using custom clothing images that were quite different from the training data.

Here are examples of results we received during the research:

1. Replication of results described in theTowards Photo-Realistic Virtual Try-On by Adaptively GeneratingPreservingImage Content” research paper, with the original data and our preprocessing models:

Using deep learning for virtual clothing replacement

Successful (A1-A3) and unsuccessful (B1-B3) replacement of clothing


  • B1 – poor inpainting
  • B2 – new clothes overlapping
  • B3 – edge defects

2. Application of custom clothes to default person images:

Clothing replacement using custom clothes

Clothing replacement using custom clothes


  • Row A – no defects 
  • Row B – some defects to be moderated 
  • Row C – critical defects 

3. Application of default clothes to the custom person images:

Outputs of clothing replacement on images with unconstrained environment

Outputs of clothing replacement on images with an unconstrained environment


  • Row A – edge defects (minor)
  • Row B – masking errors (moderate)
  • Row C – inpainting and masking errors (critical)

4. Application of custom clothes to the custom person images:
Clothing replacement with unconstrained environment and custom clothing images

Clothing replacement with the unconstrained environment and custom clothing images


  • Row A – best results obtained from the model
  • Row B – many defects to be moderated
  • Row C – most distorted results

When analyzing the outputs, we discovered that virtual clothes try on still has certain limitations. The point is the training data should contain paired images of the target cloth, and people wearing this cloth. If given a real-world business scenario, it may be challenging to accomplish. The other takeaways from the research are:

  • The ACGPN model outputs rather good results on the images of people from the training dataset. It is also true if custom clothing items are applied.
  • The model is unstable when it comes to processing the images of people captured in varying lighting, other environmental conditions, and unusual poses.
  • The technology for creating virtual dressing room systems for transferring 2D clothing images onto the image of the target person in the wild is not yet ready for commercial applications. However, if the conditions are static, the expected results can be much better.
  • The main limiting factor that holds back the development of better models is the lack of diverse datasets with people captured in outdoor conditions.

In conclusion, I’d say that current virtual fitting rooms work well for items related to separate body parts like head, face, feet, and arms. But talking about items where the human body requires to be fully detected, estimated, and modified, the virtual dressing is still in its infancy. However, the AI evolves in leaps and bounds, and the best strategy is to stay tuned and keep trying.

Data Sience Serices MobiDev

Want to get in touch?

contact us

By submitting your email address you consent to our Privacy Policy. You can withdraw your consent at any time by sending a request to

Content Download PDF
Open Contents
Content Download PDF
Machine Learning Trends to Impact Business in 2021-2022

Machine Learning Technology Trends To Impact Business in 2023


Exploring Deep Learning Image Captioning

Human resources technology trends

HR Technology Trends in 2023: Digitalization with a Human Touch