Listen to "AI Virtual Assistant Technology Guide 2022" on Spreaker. Close
Image Source

How to Build an AI Assistant: Virtual Assistant Technology Guide 2024

20 min read


Listen Audio

Smart virtual assistants penetrate all business areas, improving  brand image and reducing the burden on customer support employees. In recent years, thanks to Artificial Intelligence, these solutions have reached a new cutting-edge stage of development, from GPT to realistic digital humans. 

If you are going to create a new innovative product inspired by these examples, this article will help you with the approaches and challenges of virtual assistant app development from our experience. With good planning and execution, your own AI assistant can help you stand out from the market, so let’s  get started!

Types of AI Virtual Assistant Apps

There are several different types of AI virtual assistants: сhatbots, voice assistants, AI avatars, and domain-specific virtual assistants.

  • Chatbots have been mainstream in the eCommerce sector since their inception. Still, modern implementations of chatbots are powered by Artificial Intelligence, which gives them the ability to think through customer queries rather than push the customer through a chain of static events.
  • Voice assistants use automatic speech recognition and Natural Language Processing to give vocal responses to queries. Well-known examples of such voice assistants are Siri and Google Assistant products.
  • AI avatars are 3D models designed to look like humans, that are used for entertainment applications, or to give a human touch to virtual customer support interactions. Cutting-edge technology from companies like NVIDIA can produce nearly true-to-life human avatars in real-time. 
  • Domain-specific virtual assistants are highly specialized implementations of AI virtual assistants designed for very specific industries, and are optimized for high performance in travel, finance, engineering, cybersecurity, and other demanding sectors. 

AI assistants can be also divided into three main types: voice-activated, task-oriented, and predictive.

  1. Voice-activated assistants, for example, Siri or Alexa, are triggered by voice commands and are designed for simple tasks like searching for information, setting alarms, or playing music.
  2. Task-oriented assistants are built for specific purposes, for example, to schedule appointments, and send or organize e-mails.
  3. Predictive assistants, like Google Now or Cortana, utilize ML algorithms to predict a user’s needs and offer relevant information and services before they even ask.

While all these assistants solve the same business needs, from the technical implementation perspective, each project has its own features and development challenges. We’ll talk about them later in this article

The Technologies Behind AI Assistants

Before you go ahead and build artificial assistants, you have to go through the basics of how they work. Here are some key technologies behind powering AI virtual assistants to increase productivity, convenience, and cost-saving benefits. Let’s dive into the details.     

1. Speech-to-text (STT) and text-to-speech (TTS)

Speech-to-text technology transforms human speech into digital signals. Here is a simple explanation of how it works. When a person speaks, they create a series of vibrations. With an analog-to-digital converter (ACD) the system converts them into digital signals and extracts sounds, then segments them and compares them to existing phonemes. With the usage of complex mathematical models, the software matches these phonemes with individual words and phrases and creates a text interpretation of what a person articulated.

Text-to-speech technology is based on the opposite algorithm. It converts text into voice output. TTS is a simulation of human speech from text, utilizing machine learning. Transforming text to voice includes three steps. The system needs to convert text to words, then perform phonetic transcription and then convert transcription to speech. 

Speech-to-text (STT) and Text-to-speech (TTS) are employed in developing virtual assistant software to provide smooth and efficient communication between users and applications. To transform a simple voice assistant with static commands into a more sophisticated AI assistant, it’s also necessary to add to the application the ability to interpret user requests with intelligent tagging and heuristics. 

2. Computer Vision (CV)

Computer vision is an area of AI that trains machines to interpret and understand visual signals. With digital images from cameras and videos and deep learning models, computers can identify and classify objects accurately, and then respond to that input. CV is an essential part of creating visual virtual assistants. These assistants can respond with generated videos in addition to sounds, which significantly enriches the user experience. 

Computer vision helps to recognize and interpret body language, which is a crucial aspect of communication. Visual virtual assistants with CV utilize a camera that stores data and uses real-time face detection to notice when a person is looking at the screen, this transmits a signal to the system, which transforms the user’s speech into text. CV can also considerably improve the accuracy of speech recognition by comparing what the person has said verbally to the movement of the face and mouth.

3. Natural Language Processing (NLP) 

To process and interpret the data further, we need Natural Language Processing (NLP).  NLP simplifies the speech recognition process. While there are AI models that are pre-trained on numerous voice samples, it can be a necessity to add unique data from customers to increase accuracy. If the AI assistant is intended to give responses verbally, speech synthesis solutions for example Google Cloud’s solution will be required. 

At the same time, speech processing is not enough to communicate a person’s intention and maintain effective communication. The request needs to be interpreted right, and that’s when we need Natural Language Understanding technology that analyzes the natural language without standardizing it and gives meaning from questions by identifying the context. In other words, NLP processes grammar, and structure, while NLU tries to analyze the actual intention behind the query.

Natural language generation delivers natural language output. With NLG, customers get human-like responses from virtual assistants and chatbots. Approaches and techniques that are used for NLG can be various and the choice of the model depends on the purposes of the project and development resources.

4. Deep Learning 

Chatbots that operate with text-based responses only are obviously less complicated than voice assistants. However, cutting-edge text generation systems such as GPT-4 are capable of producing not only responses to basic questions but more unique stories from the information they have. This happens thanks to deep learning technology.

AI assistants with deep learning algorithms always learn using their data and human-to-human dialogue. They analyze existing interactions between customers and support staff and create messages and responses and “correct” the possible typos and grammatical errors. 

5. Augmented Reality (AR)

Augmented reality is an amazing technology that allows us to include 3D objects in the real world for an immersive experience. AR-based mobile chatbots and AR avatars are excellent examples of utilizing this technology. Combined with artificial intelligence, AR virtual assistants become more convenient and impressive all the same time. 

6. Generative Adversarial Networks  (GANS)

A generative adversarial network (GAN) is a machine learning (ML) model in which two neural networks compete with each other by using deep learning methods to become more accurate in their predictions. GANs include real image samples and generators processed with discriminators to generate a realistic 3D face for AI avatars and 3D assistants. 

The technology has been used in video games and other products to produce true-to-life human figures. A great example of this technology is Nvidia’s Omniverse Avatar Project Maxine, which creates a photorealistic real-time animation of a human face speaking a text-to-speech sample.

7. Emotional intelligence (EI)

When we are talking about virtual assistants, body language and human emotions might also play a huge role in addition to voice and visual effects. Emotional Intelligence powered with AI helps to track the user’s non-verbal behavior in real-time when communicating and reacting considering that information. Thanks to Emotion AI, AI virtual assistants can monitor human emotions by analyzing facial expressions, body language, or speech.

Emotion AI is based on computer vision and machine learning algorithms as well. Facial recognition technology analyzes facial expressions using the camera of the devices. Computer vision algorithms detect the key facial points and track their movement to interpret emotions. Then, the system interprets the feelings based on a combination of facial expressions by matching the collected data with a library of images. Modern solutions like Affectiva or Kairos can offer to recognize the following emotional reactions as joy, sadness, anger, contempt, disgust, fear, and surprise.

AI Virtual Assistant for Your Business: Develop from Scratch or Use a Ready-to-Use Model? 

The technical implementation of virtual assistants for business depends on the requirements of the project and the functionality of your future application. 

The fact that you want to create a smart virtual assistant with the adoption of AI technology does not always mean that you will need to develop custom models and involve data science experts. The AI ​​market is on the rise, and as in other fields of development, there are semi-finished tools on the market that may be enough to solve your problems. So how to decide?

When to Use A Ready-Made API

In our practice, the use of ready-made solutions is often justified if the client sees AI as not a core feature of its product. For example, if you are building a financial assistant app, among other things, it has to extract data from checks and enter this information into the application. In this case, an OCR module created on the basis of existing solutions will probably suit you, because it will be faster and more cost-effective.

The development of reusable AI models, called foundation models, (there is a “paradigm for building AI systems” in which a model trained on a large amount of unlabeled data can be adapted to many applications), allows easy customization on the basis of these existing solutions. Early examples of models, like GPT-3, BERT, or DALL-E 2, have shown what’s possible. About the same time ChatGPT debuted, another class of neural networks, called diffusion models, made a splash. Their ability to turn text descriptions into artistic images attracted casual users to create amazing images that went viral on social media.

Foundation models usually have APIs to use and do not require a lot of data for fine-tuning, which makes them a good solution for simple AI tasks like chatbots.

When to Use Custom AI Model Development

What you should remember is that ready-made services can solve common tasks at an average level of quality, so this is not a solution if AI is the main feature of your product. The more complex tasks you are going to entrust to AI and the more innovative your idea, the more likely it is that the capabilities of existing models will not be able to meet your needs. This is where AI engineers who can build and train models specifically for your case using related technologies will come to your aid.

It should also not be forgotten that the AI ​​feature cannot exist separately from the IT infrastructure. When developing an AI assistant, you should understand exactly how it will interact with your users (mobile application, web). Tech experts will help you choose the best technical stack, be it Node.js, PHP, Python, or another technology, and will also provide all the necessary pieces for scaling and moving your AI feature to the infrastructure on the server side. This includes taking into account the load, number of users, etc.

Сhallenges of AI Virtual Assistant Development

Let’s briefly consider how challenging it can be to create a particular AI virtual assistant. This will help you understand what to prepare for while creating a particular solution for your business. 

1. Chatbots

A chatbot is the simplest type of software that can help to provide customers with virtual assistance services. While many consider it the easiest one, a chatbot can still really make a difference. Chatbots use natural language processing (NLP) to understand customer questions and automate responses to them based on a predefined flow. Today’s AI chatbots also utilize natural language understanding (NLU) to determine the user’s needs more accurately. Then, they use advanced AI technologies to analyze what the user is trying to accomplish.

As we have already mentioned, chatbots are based on a predefined workflow. The NLP part of the chatbot defines what kind of query a user has and then switches to that part of the flow that is relevant to the request.

At MobiDev, we work with the engines like Dialogflow, Rasa, and others to build chatbots for different business domains. 

For example, Rasa is an open Generative Conversational AI platform that helps build AI assistants. However, it still requires the involvement of developers to deliver a solution that will meet unique business needs. The system must be correctly configured at the initial stage, all dialogues, and transitions between the elements of the system must be thought out, etc. 

Also, our team has experience in developing chatbot builders no-code solutions for creating chatbot flows for people without coding skills. These solutions can be utilized by customer support departments to establish workflows where the chatbot can directly assist customers or seamlessly transfer them to a customer support specialist.

2. Voice assistants

Each chatbot can be transformed into a voice assistant with the help of speech-to-text and text-to-speech models. 

The biggest challenge in developing voice assistant solutions lies in the fact that in many regions there are security regulations that prohibit the browsers to track and process the user’s voice without their consent. Therefore, it’s better to work with these solutions in the form of applications. If we need to develop this solution in the browser, it’s necessary to implement the mechanism that will ask a user for consent before using their voice (the consent button). 

It’s also important to consider the price of the voice assistant.  If you utilize ready-to-use text-to-speech services, from Google and Amazon, you will be priced based on the number of characters sent to the service to be synthesized into audio each month. You must enable billing to use text-to-speech and will be automatically charged if your usage exceeds the number of free characters allowed per month. It can be super expensive, especially if it is difficult to predict how many characters you will be using per month,  and how often and long your users will interact with the assistant. 

Of course, you can build your own text-to-speech models but this requires high-quality devices that will process those models. 

Another possible challenge will appear if you consider multi-lingual support, as each model has a particular number of languages. It’s essential to keep that aspect in mind while the planning stage if you want to scale up and add new languages in the future.

3. AI avatars

AI avatars are the most exciting, but the most difficult, type of AI virtual assistant yet. Human-like virtual companions like NEON Artificial Humans look breathtaking, but the development of such solutions is also incredibly difficult and requires a lot of investment. 

Let’s take a closer look at the challenges of developing AI avatars and their alternatives.

1. Design and Animation

When creating AI avatars there are significant investments in design. After all, in order for the avatar to look realistic, it is necessary to create a 3D character and fully animate each position (facial expressions, head turns, body movements, etc.). 

The functionality of your avatar imposes additional design requirements. You have to create all the motion sequences and draw each scenario, which is a huge task for designers, especially when it comes to a human-like avatar.

2. Lip-synced with avatar animations

If you want your avatar to communicate with your users in real-time using voice, you should understand what is behind the technical implementation of this function as well as its design. Available open-source solutions are intended for generating video in asynchronous mode after loading the text (phrase) and it can take up to several minutes to receive visual feedback. 

You can use partial animation of a person, for example, utilizing Microsoft Azure Neural Text to speech animation for animating the lips of an avatar, but you should remember the uncanny valley effect that creates a creepy feeling in users when seeing a super-realistic character that still does not copy a person accurately enough. It’s necessary to gather datasets and train models for better animation and synchronization. 

Another option is to completely abandon the idea of lip-synced avatar animations when audio is played. We can create a graphically attractive application interface, for example, design avatars that seem to type or voice something. This will significantly reduce the costs of design development and software implementation. This step will be a starting point, and as the project develops, it will be possible to move on to more advanced animation.

3. Rendering

One of the key points that significantly affects the price of project implementation is the functionality of the avatar. This functionality doesn’t add a direct business value, though it is the most expensive from the development and support perspective (the cost of servers that will render avatars). Therefore, we would recommend looking at options when the avatar carries only the design load or when it is rendered on the client side in the most simplified version.

4. Development

There are several available ready-made services that could help you create an AI avatar for your business like  UneeQ or The D-ID Live Streaming API, but the licenses they provide are very expensive for startups.  NVIDIA plans to provide access to its platform for creating avatars in the near future as well, but we can’t yet predict prices for this solution.

Of course, you can create a custom avatar from scratch using existing technologies. For example, Google created a series of Generative AI templates to showcase how combining Large Language Models with existing Google APIs and technologies can help in creating Talking Character 3D avatars. However, you still need to create your own ML model, which takes time to develop and train. It should also be taken into account,that in the case of creating a custom ML model for your avatar, an additional phase of data collection will be needed for training it.  

Let’s imagine that we have decided on the solution and can move to the next step, so you chose some characters with lip-synced animation. What’s next? To build interaction of human users with an avatar we need to implement a backend to control it. Here, several challenges must be solved:

  • User’s speech to text. We need to record what the user said, clean recording from noise, and convert speech to text.
  • Chatbot. We need to have a smart chatbot that will generate text answers to the user’s query and remember the context of previous exchanges. The chatbot also should present a character with configurable features.
  • Chatbot output must be converted to speech audio with the appropriate voice. Ideally, the voice tone should be configurable and correspond to the emotions of the phrase.
  • Voice audio files must be lip-synced with avatar animations exactly when audio is played back.
  • The avatar must have some default animations with natural movements when it is not speaking. 

Therefore, with a limited budget for 3D design and development, it is better to set realistic requirements for this type of task and start with small steps, gradually improving your avatar. 

Here are our tips:

  1. Use characters that are not human-like , for example, animals, robots, etc. or at least talking heads that will take less effort to design and animate. In order to  diversify the characters, we can add micro animations (shoulder movement, micro-turning of the head, etc.). 
  2. If you aren’t ready to invest in gathering datasets, you can create an avatar using ready-to-use models and instead of lip-sync, implement synchronization of the animations with the emotional tone of the conversation. For example, your avatar can change color or brightness depending on the emotional background. Things like a fire animation that becomes brighter when something good is happening and fades as a response to  sad messages. 
  3. It can be a wise decision to give up the idea of the synchronization of the avatar with the dialogue. For example, you can use a particular gaming engine to create a 2D or 3D avatar that will be visible to the user, and the interaction with the user will be held with a chatbot. It can serve as a starting point for your projects with the plan for further development in the future. 

When planning the creation of an AI avatar, you should take into account the costs of the work of a 3D designer, the cost of supporting one user, and the cost of developing and integrating third-party tools. 

Let MobiDev Help You Create AI Assistant App for Your Business

To conclude, while thinking about having an AI assistant for your business, it’s crucial to define your goals, choose the right AI platform, develop the AI logic, train the system, design the user interface, build and test the assistant, and deploy it.

The extensive technological experience of MobiDev and the in-house AI lab allows us to consider the implementation of both the AI ​​part of development and the infrastructure for its operation. This approach helps to build the most effective solution based on both your business tasks and the cost of development. 

Our team implements a complex approach and doesn’t just develop software. We research your business domain in detail and help refine your idea to obtain a solution that will really benefit your business and allow it to scale in the future. If ready-made solutions that meet your needs are available, we will integrate them, and if not, we will create custom solutions based on the multiple years of experience and top-notch skills of our developers.

So, if you want to create a smart chatbot or more advanced virtual assistant and look for experienced AI developers, feel free to drop us a message or book a call with a MobiDev representative.

Open Contents

Listen Audio



Whether you want to develop a new product or update an existing one, we're eager to assist. Call us or fill in the form via CONTACT US.

+1 916 243 0946 (USA/Canada)



Real-time Video Processing with AI

AI Assisted Real-time Video Processing

How to implement AI self-checkout in retail

How to Implement AI Self Checkout in Retail if You Are …

Voice Assistant Technology For Enterprise

Voice Assistant Technology For Enterprise: The Internet…

We will answer you within one business day