GPU for Deep Learning: Benefits & Drawbacks of On-Premises vs Cloud
As technology advances and more organizations are implementing machine learning operations (MLOps), people are looking for ways to speed up processes. This is especially true for organizations working with deep learning (DL) processes which can be incredibly lengthy to run. You can speed up this process by using graphical processing units (GPUs) on-premises or in the cloud.
GPUs are microprocessors that are specially designed to perform specific tasks. These units enable parallel processing of tasks and can be optimized to increase performance in artificial intelligence and deep learning processes.
GPU is a powerful tool for speeding up a data pipeline with a deep neural network. The first reason to use GPU is that DNN inference runs up to 10 times faster on GPU compared to a central processing unit (CPU) with the same pricing. The second reason is that taking some of the load off the CPU allows you to do more work at the same instance and reduces network load overall.
Typical Deep learning pipeline with GPU consists of:
- Data preprocessing (CPU)
- DNN execution: training or inference (GPU)
- Data post-processing (CPU)
Data transfer between CPU RAM and GPU DRAM is the most common bottleneck. Therefore there are two main aims for building Data Science pipeline architecture. The first is to reduce the number of transferring data transactions by aggregation several samples (images) into a batch. The second is to reduce the size of a specific sample by filtering data before transferring.
Training and implementing DL models requires the use of deep neural networks (DNN) and datasets with hundreds of thousands of data points. These networks require significant resources, including memory, storage, and processing power. While central processing units (CPUs) can provide this power, the use of graphical processing units (GPUs) can substantially speed the process.
Main benefits of using GPU for deep learning
- The number of cores—GPUs can have a large number of cores, can be clustered, and can be combined with CPUs. This enables you to significantly increase processing power.
- Higher memory—GPUs can offer higher memory bandwidth than CPUs (up to 750GB/s vs 50GB/s). This enables you to more easily handle the large amounts of data required for deep learning.
- Flexibility—GPUs capacity for parallelism enables you to combine GPUs in clusters and distribute tasks across the cluster. Or, you can use GPUs individually with clusters assigned to the training of individual algorithms.
When using GPU for deep learning tasks is not rational
GPUs are super fast compared to CPU and for many cases of AI applications GPUs are a must have. But in some cases, GPUs are overkill and you should at least temporarily use CPUs to save your budget.
Here we need to say a few words about the cost of GPU calculations. As we mentioned before, GPUs are significantly faster than CPUs, but the calculation cost may be even greater than the speed you gained by switching to GPU.
So, at the beginning of development, for example, while developing proof of concept (PoC) or a minimum viable product (MVP), you can use CPUs for development, testing, and staging servers, and in cases where your users are okay with long response time, you can use CPU for production servers, but only for a short period of time.
On-Premises GPU Options for Deep Learning
When using GPUs for on-premises implementations you have multiple vendor options. Two popular choices are NVIDIA and AMD.
NVIDIA is a popular option at least in part because of the libraries it provides, known as the CUDA toolkit. These libraries enable the easy establishment of deep learning processes and form the base of a strong machine learning community with NVIDIA products. This can be seen in the widespread support that many DL libraries and frameworks provide for NVIDIA hardware.
In addition to GPUs, the company also offers libraries supporting popular DL frameworks, including PyTorch. The Apex library in particular is useful and includes several fused, fast optimizers, such as FusedAdam.
The downside of NVIDIA is that it has recently placed restrictions on when CUDA can be used. These restrictions require that the libraries only be used with Tesla GPUs and cannot be used with the less expensive RTX or GTX hardware.
This has serious budget implications for organizations training DL models. It is also problematic when you consider that although Tesla GPUs do not offer significantly more performance over the other options, the units cost up to 10x as much.
AMD provides libraries, known as ROCm. These libraries are supported by TensorFlow and PyTorch as well as all major network architectures. However, support for the development of new networks is limited as is community support.
Another issue is that AMD does not invest as much into its deep learning software as NVIDIA. Because of this, AMD GPUs provide limited functionality in comparison to NVIDIA outside of their lower price points.
Cloud Computing with GPUs
An option that is growing in popularity with organizations training DL models is the use of cloud resources. These resources can provide pay-for-use access to GPUs in combination with optimized machine learning services. All three major providers offer GPU resources along with a host of configuration options.
Microsoft Azure grants a variety of instance options for GPU access. These instances have been optimized for high computation tasks, including visualization, simulations, and deep learning.
Within Azure, there are three main series of instances you can choose from:
- NC-series—instances optimized for network and computing-intensive workloads. For example, CUDA and OpenCL-based simulations and applications. Instances provide high performance based on the NVIDIA Tesla V100, the Intel Haswell, or the Intel Broadwell GPUs.
- ND-series—instances optimized for inference and training scenarios for deep learning. Instances provide access to NVIDIA Tesla P40, Intel Broadwell, or Intel Skylake GPUs.
- NV-series—instances optimized for virtual desktop infrastructures, streaming, encoding, or visualizations and support DirectX and OpenGL. Instances provide access to NVIDIA Tesla M60 or AMD Radeon Instinct MI25 GPUs.
Amazon Web Services (AWS)
In AWS you can choose from four different options, each with a variety of instance sizes. Options include EC2 P3, P2, G4, and G3 instances. These options enable you to choose between NVIDIA Tesla V100, K80, T4 Tensor, or M60 GPUs. You can scale up to 16 GPUs depending on the instance.
To enhance these instances, AWS also offers Amazon Elastic Graphics, a service that enables you to attach low-cost GPU options to your EC2 instances. This enables you to use GPUs with any compatible instance as needed. This service provides greater flexibility for your workloads. Elastic Graphics provides support for OpenGL 4.3 and can offer up to 8GB of graphics memory.
Rather than dedicated GPU instances, Google Cloud enables you to attach GPUs to your existing instances. For example, if you are using Google Kubernetes Engine you can create node pools with access to a range of GPUs. These include NVIDIA Tesla K80, P100, P4, V100, and T4 GPUs.
Google Cloud also offers the Tensorflow processing unit (TPU). This unit includes multiple GPUs designed for performing fast matrix multiplication. It provides similar performance to Tesla V100 instances with Tensor Cores enabled. The benefit of TPU is that it can provide cost savings through parallelization.
Each TPU is the equivalent of four GPUs, enabling comparatively larger deployments. Additionally, TPUs are now at least partially supported by PyTorch.
What is the Best GPU for Deep Learning Tasks in 2021?
When the time comes to choose your infrastructure you need to decide between an on-premises and a cloud approach. Cloud resources can significantly lower the financial barrier to building a DL infrastructure.
These services can also provide scalability and provider support. However, these infrastructures are best for short term projects since consistent resource use can cause costs to balloon.
In contrast, on-premises infrastructures are more expensive upfront but provide you with greater flexibility. You can use your hardware for as many experiments as you want over as long a period as you want with stable costs. You also retain full control over your configurations, security, and data.
For organizations that are just getting started, cloud infrastructures make more sense. These deployments enable you to start running with minimal upfront investment and can give you time to refine your processes and requirements. However, once your operation grows large enough, switching to on-premises could be the choice.
Using GPU for AI training by MobiDev
Our AI team has at its disposal huge computational resources like a set of V100 GPUs. All of those GPUs are accessible via our internal computation service.
The computation service is just a computer with a lot of disk space, RAM and GPUs installed on it, and running Linux. We use the service for training AI solutions and for research purposes.
Usually, traditional deep learning frameworks like Tensorflow, Pytorch or ONNX can’t directly access GPU cores to solve deep learning problems on them. Between an AI application and the GPU there are a lot of complex layers of special software like CUDA and drivers for the GPU.
In an oversimplified schema it can be shown as follows.
The schema seems legit and robust even in case a team of AI engineers use a computation service build like that.
But in the real life of AI application development new versions of AI applications, AI frameworks, CUDA, and GPU drivers emerge. And it often appears that new versions of software are not compatible with the old ones. For example, we have to use a new version of an AI framework that is not compatible with the current version of CUDA on our computation service. What should we do in such a situation? Should we update CUDA?
Definitely we can’t do that because some other AI engineers require an old version of CUDA for their projects. That’s the problem.
So, the problem is that we can’t have two different versions of CUDA installed in our computation service as on any other system. What if we can do some trick that can magically isolate our applications from each other so that they don’t touch each other, and don’t know about the existence of each other? Thankfully, we do have such a trick nowadays! The name of the trick is dockerization techniques.
We use Docker and nvidia-docker to wrap AI applications along with all necessary dependencies like AI Framework and CUDA of proper versions. This approach makes us able to maintain different versions of Tensorflow, Pytorch, and CUDA on the same computation service machine.
Simple schema of AI solutions dockerization is shown below.
To advance quickly, machine learning workloads require high processing capabilities. As opposed to CPUs, GPUs can provide an increase in processing power, higher memory bandwidth, and a capacity for parallelism.
You can use GPUs on-premises or in the cloud. Popular on-premise GPUs include NVIDIA and AMD. Cloud-based GPU can be provided by many cloud vendors, including the top three—Azure, AWS, and Google Cloud. When choosing between on-premise and cloud GPU resources, you should consider budget and skills.
On-premise resources typically come with a high upfront overhead, but the cost can stabilize in the long term. However, if you do not have the necessary skills to operate on-premise resources, you should consider cloud offerings, which can be easier to scale and often come with managed options.