Testing AI Applications: Best Practices and a Case Study
Image Source

Testing AI Applications: Best Practices and Case Study

16 min read
AI/ML QA

Share

As experts in creating AI applications, we draw customers’ attention to the peculiarities of their development. Quality assurance of such software products also has specificity, and I will explain more about that in this article. I will talk about AI application testing, its challenges, possible edge cases, and best practices. You will also get an idea of how we conduct the testing of AI systems that we develop.

AI App Testing: Types, Tools, Differences

Testing of AI systems is specific in many aspects. This concerns, in particular, the use of types of testing, toolkits, the involvement of QA engineers in work processes, etc. Let’s consider the most significant things in this context one by one.

TYPES OF TESTING FOR AI APPLICATIONS

AI applications are usually understood as a wide range of software products, from facial recognition and clinical diagnosis assistance to recommendation systems and AI-powered apps for identifying unknown threats. There is no common, unified solution to AI app testing because such software is based on distinct algorithms and models and has specific features. Besides that, AI is just a part of an application that has to be integrated with other components. That’s why a unique testing approach is required in every project, as the architecture of the app varies and there is no standard way to build and test AI apps. It is advisable to combine, for such approaches, traditional QA methods with AI-specific, specialized testing techniques. 

We can distinguish the following 6 key types of testing that have proven valuable for AI-based products:

1. Functional testing for checking the core functionality. Traditionally, QA specialists verify via functional testing that the software operates according to the functional requirements. For AI apps, functional testing includes ensuring that the core AI algorithms and logic produce the expected outcomes in various scenarios.

2. Usability testing helps to evaluate the user-friendliness of the app and the ease and convenience of interacting with it. For example, for conversational AI, usability testing includes assessing the conversational user experience, “human” or “natural” language understanding, and error handling.

3. Integration testing is usually about checking the correctness of the “teamwork” of different integrated components. For AI-powered apps, integration testing ensures seamless integration of AI models with other software components, databases, and external APIs.  

4. Performance testing is a good practice for assessing a software system’s responsiveness, scalability, and resource usage. It goes beyond the core app’s functionality. For AI, this testing type takes a specific form and involves measuring overall model performance, response times, throughput, and other key performance metrics. Performance testing of AI applications brings considerable benefits. Such a procedure helps to understand, optimize, and improve the AI-app capabilities and data processing under different realistic conditions due to desired results and user expectations.

5. API testing focuses on verifying the functionality of application programming interfaces, including testing of individual methods and interactions between them. For AI-driven systems, API testing may also be aimed at verifying the endpoints for AI services, data input/output, and API response format.

6. Security testing aims to prevent leakage of data processed by the AI model, as well as model configuration and system information. Also, security testing helps to avert misuse of AI-based application capabilities.

I have drawn your attention to those types of testing, the skillful application of which will most beneficially affect the quality assurance of AI software. At the same time, along with the ones I mentioned, other types of testing can also be used for AI-based applications, depending on their specifics and requirements. 

SPECIFIC TOOLS FOR TESTING AI

Product owners do not need to delve into testing tools in detail, as our project teams take care of all technical issues. However, I will introduce you to a list of tools we have found helpful and that we like to use when testing AI apps.

There is no single, all-encompassing tool for testing AI-based systems. Each project has its specifics and, therefore, requires a unique testing approach. On the other hand, several general-purpose frameworks and libraries can be a core of the AI model testing system in many cases.

For example, TFX (TensorFlow’s TensorFlow Extended) is suitable for such testing purposes as data validation and processing, model analysis and training, model performance, etc. This platform can also be used to build recommendation engines by processing vast datasets, training recommendation models, and offering personalized recommendations to users. TFX can handle image data, allowing the training and serving of computer vision models for tasks like object detection, image classification, and facial recognition.

Among similar frameworks and libraries for testing, it is appropriate to mention the Scikit-learn, PyTorch’s torch.testing module, etc. Additionally, there are specialized tools and libraries for specific types of AI testing, such as FairML for bias and fairness testing or TensorFlow Model Analysis for model evaluation.

The landscape of AI testing tools is continuously evolving. There are at least several reasons why it may seem that there are not enough specific tools for testing AI models: rapid advancements in AI, the variety of use cases, tool development complexity, etc. Due to the diverse nature of AI applications and the rapid pace of innovation, our QA engineers stay updated on the available tools and choose the ones that best suit product-specific testing needs.

TWO SIGNIFICANT DIFFERENCES IN AI SOFTWARE TESTING

Testing is a crucial phase in any data-driven project, especially when it comes to artificial intelligence apps. Such projects involve both AI model testing and the testing of the whole software product with an integrated AI module. In addition, not all testing approaches that are effective for non-AI software are suitable for testing AI-based products.

When planning how to test AI and ML applications, you should consider the following:

  1. Testing AI systems is sustained. As a rule, released software without AI (let’s call it traditional) will need to be tested again only after making changes. However, in testing AI applications, things are different. It is necessary to continuously retrain the existing AI model, including to prevent AI degradation, and to adjust software to new data and inputs. Testing helps assess how well the model can work with new, unseen data. Therefore, the need to retest AI apps exists sustainably.
  2. Testing is essential for obtaining data for an AI model. AI apps are not deterministic, unlike traditional software. This means, in particular, that such systems may behave differently with the same inputs. Quality assurance in this context is more than just ensuring bug-free code. That’s why there should be data quality assurance, verifying the accuracy of AI predictions, chatbot response reliability, and models’ robustness.

In particular, a data set suitable for analysis for decision-making should be unbiased, that is representative, containing all possible outcomes that may occur. The representativeness of the dataset that reflects real-life scenarios is achieved by testing. Below we will consider the role of testing in the preparation of data for AI models using the example of one of our projects.

How to Test AI Applications: Highlights

Let’s consider the specifics of testing AI systems through a case study.

Case Study: Interactive App with a 3D CG Character

Product description: 3D CG (Computer Generated) character (virtual human, avatar) that is able to interact with people. The purpose of the system is the promotion of the 5G network.

Project goal: Creating a 3D CG character driven by artificial intelligence. The app is designed not only for entertainment, but as a guide-demonstrator for the future of communication powered by the potential, quality, and speed of 5G networks.

The project’s tech stack:

  • Front-end: JS
  • Back-end: Python, Unreal Engine server, REST API
  • Kiosk-side: Android device with a camera, external microphone, TV screen, and external loudspeakers.

Tech stack of the AI part of the project:

  • Human detection approach: MediaPipe Platform
  • Speech recognition approach: NeMo Model for speech recognition, Neuspell as a spelling correction model
  • Natural Language Understanding: RASA
  • Voice recognition approach: Dialogflow – conversational AI

Details about hardware: The application has its hardware part. We named it “Kiosk setup”. 

The kiosk app was installed on an Android device equipped with a camera for human detection and gesture recognition, so this digital information we got from it. An external microphone, connected to the Android device, captured the human speech. Here we get the audio of the phrases, and it goes to the back-end side for voice and speech recognition, and spell checking as well, so the character can identify the user’s intention and react according to it. Also, the TV screen and loudspeakers were connected to the device, where the 3D CG character would come to life and interact with a user.

QA team participation: The QA team provided the end-to-end testing (we didn’t test the AI models themselves, but tested the system overall) and was involved in dataset hotfixes.

I will not fully describe the entire testing process but will only highlight a few features characteristic of testing AI systems.

DATA SOURCING SPECIFICS FOR AI APPLICATIONS

Sourcing, gathering, and preparing data for AI apps requires the active involvement of testing specialists. The nature of the data depends on the app’s goal: numbers, specific text, images, and sounds.

For example, for QA engineers, in the realm of AI, the human body serves as a rich source of test data for various interactions with AI models. These interactions span across human detection, voice and speech analysis, gestures, and emotion recognition. In particular, we analyzed the following metrics:

1. The system’s ability to detect a human, background interference, and handle multiple people in its field

We learned how the lighting noticeably affects the accuracy of the system. Low-lightening environments negatively impact the system’s abilities. Hence, effective operation in the dark is only possible with integrated Infrared (IR) cameras. People segmentation is a challenge for the system, especially if they are close to each other. The system can identify and focus on the “wrong” person that is not going to interact with it. 

Also, the necessary details for accurate recognition may not be visible in the image because of the long distance, leading to reduced accuracy. A cluttered or dynamic background, such as a busy street or moving objects, can make it challenging for the system to isolate and recognize a person accurately. Therefore, in our case, we aimed to make the system mitigate this issue and focus on the human body.

The QA team provided photos with different gestures, at different distances from the camera, and gave them to AI developers, who additionally trained models on this data. By training the models with real people making genuine gestures, the system learns to recognize natural movements and expressions, making the interactions with the 3D CG character more authentic and engaging.

2. The speech recognition and what avatar behavior it triggers

Among other tasks, we had to ensure the interactivity of the conversation. Therefore, it was necessary to reduce the impact of a noisy environment. After greetings, the CG character had to react and continue the conversation by asking some questions. We measured the effective distance between the human and the external microphone, to define what distance where the system can get a quality speech/voice record and recognize the necessary details.

For the noise challenge, we used two solutions: implementing noise cancellation algorithms, and choosing the external microphone with the appropriate directionality. 

Our idea worked. Directional microphones focus on sound from specific directions and reduce noise from the sides and rear. This enhanced the clarity of the user’s speech and the audio recording quality. On the other hand, implementing the noise cancellation algorithms helped to minimize ambient noise interference. All this optimized our results.

EDGE CASES FOR AI APP TESTING

The complexities of human interactions with AI models often unveil a wide array of edge cases that AI engineers and QA teams must carefully consider. Let’s move on to how we approached the development of edge cases for apps based on Speech-To-Text and Hybrid AI models, which help to identify unexpected system behavior. 

I mentioned earlier that testing helps to get a representative dataset that reflects real-life scenarios. We can perform this by designing test cases relying on functional and non-functional requirements. However, the most essential and valuable is the smart design of edge cases to define unexpected system behavior.

After that, we can analyze all the gathered testing results and dependencies. In case of detected issues and “data garbage”, we enhance (update) the test data or even training data. 

This data collection and validation process may take a couple of iterations until the data quality improves. We always remember the principle of “garbage in, garbage out”. The better the data quality, the better the AI system will be. 

We should get as much information as possible about our product to identify the project needs and design the proper edge cases. 

First of all, it is about:

  • The technical stack and its possible weaknesses. In particular, it is essential to know what models and algorithms are at the core of the developed system.
  •  Possible usage environment and conditions, hardware requirements or system setup (if necessary). Similar usage environment and conditions will be required for testing.
  • Software quality criteria and customer expectations. They should be clear for both parties to avoid misunderstandings.

Thus, we design the cases in such a way as to ensure that the algorithm and the data can provide accurate, stable results. Such results must meet approved requirements and client expectations. If these conditions are met, we can consider the developed cases to be relevant to the real world.

Let us start with conversational AI. One prominent interaction domain involves speech recognition AI models. Such models aim to convert spoken language into text. AI engineers commonly use this technology in applications like voice assistants and transcription services. 

However, this seemingly straightforward task can become intricate when faced with edge cases such as:

  • Noise Challenge: AI models must grapple with noisy environments, filtering out unwanted sounds to accurately transcribe speech.
  • Inappropriate Distance: Variations in the distance between the person and the camera-microphone setup, for example, can affect the quality of audio input, leading to misinterpretations.
  • Limited Training Data: When AI models have access to a limited dataset for training, their ability to recognize different accents, dialects, and speaking styles may be compromised.
  • Varied Lighting Conditions: Changes in lighting can affect video quality, impacting lip-reading systems that rely on visual cues for speech recognition.
  • Gender and Pronunciation: The need to distinguish between male and female voices and to handle multiple ways of pronouncing words adds complexity to speech-to-text models.

Beyond voice, hybrid AI models engage in human detection, gesture analysis, and emotion recognition, often relying on data from the human body itself. These interactions introduce their unique set of challenges, including:

  • Human Detection: Robust identification of humans amidst varying backgrounds, postures, and occlusions demands sophisticated AI algorithms.
  • Gesture Interpretation: Understanding the meaning behind different gestures and body movements requires precise training and model fine-tuning.
  • Emotion Recognition: Discerning emotions from facial expressions and body language needs a nuanced understanding of human behavior and culture.

It’s clear now that the cause of a large number of edge cases is the complexity of real-world scenarios. Anyway, AI engineers and QA experts continually refine AI models to excel in these interactions, addressing edge cases with creativity and precision.

AI CHATBOT TESTING

I want to touch on the testing of fairly common chatbots. It seems that all chatbots are a type of conversational AI, but it’s not always true. Chatbots can be scripted and they are not able to learn. Also, chatbots can be artificially intelligent and improve self-capabilities after every interaction with the user. Chatbots of both types are based on NLP technology and can understand words and text meanings. Here are some edge cases:

  • Unclear intent: Users provide queries that do not align with any predefined intent. For instance, ambiguous questions or statements that could pertain to multiple intents. Check out how the chatbot handles this case.
  • Too many ways to say “hello”: Users can express the same intent using different phrases and language variations. Make sure that the scripted chatbot or even AI chatbot can work with them and provide the appropriate response.
  • Jumping between topics: Users switch intents within a conversational context. Check if the system can manage this.
  • Working with negative statements: The chatbot should be able to interpret negative user intent in the right way and “react” according to this factor.
  • User inputs beyond the chatbot’s scope or capability: Make sure the chatbot can handle the not related to its mission data and provide a polite refusal when responding to such requests.

The specificity of AI affects most aspects of testing, including the creation of scenarios and test environments as close as possible to real life.

AI Application Testing by MobiDev 

Thus, through the given examples, you have seen the differences in AI system testing and the role of QA specialists in such projects.

MobiDev quality assurance professionals have a rich and varied experience in testing AI applications. We constantly improve our QA methods and tools, carefully selecting them for the specifics of each software. Get in touch with our experts to be sure of the high-quality development and testing of your AI apps.

Contents
Open Contents
Contents

GET IN TOUCH

Whether you want to develop a new product or update an existing one, we're eager to assist. Call us or fill in the form via CONTACT US.

+1 916 243 0946 (USA/Canada)

Contact us

YOU CAN ALSO READ

Predictive Maintenance in Manufacturing Using Machine Learning

Predictive Maintenance in Manufacturing Using Machine L…

Machine Learning Application in the Manufacturing Industry

Top 8 Applications of Machine Learning in Manufacturing…

Artificial Intelligence in Manufacturing: Industrial AI Use Cases

Artificial Intelligence in Manufacturing: Industrial AI…

We will answer you within one business day