Why Your AWS Deep Learning AMI is Holding You Back and How to Fix

If you're exploring your options for Deep Learning on AWS, you've likely considered using Deep Learning AMIs (Amazon Machine Images) to simplify your setup. Although pre-configured environments can be a good starting point and look like a no-brainer, but they have several limitations that will haunt you in the long run.

The Bloatware Problem

Deep Learning Amazon Machine Images (DLAMI) comes pre-installed with a plethora of applications, frameworks, and libraries. You will not need many of them in your production environment and sometimes not even in development.

Outdated Drivers and Toolkits

Your favorite deep learning framework released a new version that offers a valuable addition to your application. You are eager to start using it but unfortunately discover that your toolkit and drivers are outdated, dampening your enthusiasm. Now you're locked into using older drivers, toolkits, and older framework and misses out on your favorite new feature of Deep Learning, which you were so excited about.

Dependency Hell

Installing required modules or libraries for your application can be challenging with these DLAMIs. You may encounter an issue where the module you are attempting to install requires version 2 of XYZ, but you only have version 1.5 installed. This issue should be resolved by simply updating XYZ. However, upon attempting that, you may find that another application ABC or library requires XYZ. When you try to remove ABC, which your application does not neet, but to your surprise, yet another application is dependent on it, and this chain of dependencies seems never-ending.

Limited Architecture Support

Suppose you want to leverage cost-effective instances like g5g.xlarge for deep learning inferences. In that case, you're out of luck because no Deep Learning AMIs support them, or the only solution available has an older OS or outdated build tools. Especially for ARM-based instances, your choices are minimal. For Example, the only DLAMI available for the mentioned instance family is NVIDIA DLAMI, built on top of older version of Ubuntu 20.04.

Solution

Frustrated with these limitations ourselves, We've developed an automated, customizable script that can set up a high-performing deep learning environment on AWS EC2. This script downloads the latest Nvidia Drivers, CUDA 12.2, and cuDNN library. It uses the latest Amazon Linux 2023 as its base AMI. It offers several advantages, including support for the latest Linux Kernel 6 and more recent versions of GCC and other build tools and utilities. This script clones PyTorch and compiles it from the source to ensure you have the latest CUDA device support.

Performance and Cost Benefits

The customization allows for a lean, performance-optimized setup with a minimal footprint. As in this script, PyTorch is compiled from source after cloning it from the official repository. It offers advantages like hardware-specific optimization and the use of up-to-date code. Hence resulting in better performance and security as compared to pre-built Pytorch module. And if your compute tasks can tolerate interruptions or you can design your application with failover tolerance in mind, you can take advantage of spot instances, which are incredibly cost-effective at as low as $0.152 per hour these days.

Want the Full Step-By-Step Guide? Dive In Here!

For a comprehensive guide addressing these problems and access to this game-changing script, check out our complete guide at Deep Learning on AWS Graviton2, NVIDIA Tensor T4G for as Low as Free with CUDA 12.2.