Training in Sagemaker with Huggingface, Tips & Tricks

Venturing into the realm of machine learning can be a bit overwhelming, especially when it comes to training and deploying models. AWS Sagemaker has a handy platform to make this process smoother, but it's not without its own quirks and challenges. In this blog, we'll explore a few lessons I've picked up along the way while developing training scripts for AWS Sagemaker and Huggingface. Hopefully this will serve useful for others as you begin your own work.

The Basics

Today we'll be focusing on the Sagemaker Training Job functionality by using the Sagemaker-Python-SDK. This SDK is moderately well documented, and is sufficient if your training jobs are being done for academic or personal purposes. However, when training in a corporate environment, there are a handful more personalizations required in order for training jobs to work appropriately that are currently not super well documented, in my opinion. Some of this is simply because Sagemaker allows for passing through parameters to the python scripts, and it's up to you to understand how the python environment is behaving.

This post assumes you already have a basic understanding of the Sagemaker-python-sdk and specifically the HuggingFaceEstimator class. If you haven't worked with the Sagemaker SDK yet, the following tips may not make much sense yet 😄

Custom PyPi Location

One of the first issues I had to overcome was how to use a custom internal corporate PyPi repository. We have various internal libraries not published publicly, so we needed to download packages from this private location.

Normally, I would add the extra PyPi index to ~/.pip/pip.conf, or would add a custom source repository to the poetry pyproject.toml. However, in this case, Sagemaker doesn't support poetry (as far as I could tell), and it directly runs the python script on entry, so you don't have an option to run bash commands to set something in pip.conf. Although you can create a custom docker image and use that with sagemaker (Using the Sagemaker Huggingface Estimator image_uri parameter), that has its own overhead and might add some more complexity for whenever you want to update anything in the image to leverage new versions of the Huggingface DLC.

Thankfully, Python supports environment variables for looking at an extra PyPi repo. PIP_INDEX_URL and PIP_EXTRA_INDEX_URL are supported (see env vars docs and install args docs). In my case, I still wanted to pull some packages from the default public PyPi repo, so I didn't want to set PIP_INDEX_URL, but wanted to add our internal PyPi repo via the PIP_EXTRA_INDEX_URL environment variable.

In this way, if you set the PIP_EXTRA_INDEX_URL environment variable via the SagemakerEstimator environment parameter and put a requirements.txt in the base location of the source_dir that was passed, now when Sagemaker runs the training job, it will be able to find all the private PyPi packages.

Using the full space of the EC2

Another hidden quirk of the HuggingFaceEstimator is that Huggingface doesn't immediately take advantage of the settings for volume_size. Although you can adjust the volume_size of the Sagemaker training job to whatever you would like, that parameter is only adjusting the size of the volume mounted to the running job in the /opt directory. And since the default cache location for huggingface artifacts is at ~/.cache, no matter how big you set volume_size to be, huggingface will run out of space if you try to download a model or data that is bigger than the root instance volume size of the container (which is the volume mapped to the ~ directory). In order to alleviate this issue, you need to adjust the settings of HuggingFace in the python training script to cache all files in the /opt directory to make use of the space allotted by the Sagemaker volume_size parameter.

I put this into my python training code, but you could also add it to the environment parameter when creating the HuggingFaceEstimator.

if os.getenv("SM_MODEL_DIR"):
    print(
        "Running in Sagemaker so setting the cache onto the EBS device instead of the HF default of ~/.cache"
    )
    os.environ["TRANSFORMERS_CACHE"] = "/opt/ml/.transformers/"
    os.environ["HF_DATASETS_CACHE"] = "/opt/ml/.datasets/"
    os.environ["HF_HOME"] = "/tmp"

You could also set the cache_dir parameter when loading the model:

model = AutoModelForCausalLM.from_pretrained(
            model_id,
            cache_dir="/opt/ml/.transformers",
        )

Passing in the HF token

Next up: sometimes in order to pull a huggingface model checkpoint, you need to authenticate using a huggingface token (this is true for those that want to use LlaMA-2, among other things). This means that the Sagemaker training job needs to have the credential provided to it in some way so that it can authenticate when downloading the model. I chose to implement this by passing my login token as an environment variable, which the training script could then use to login before downloading the model.

When setting up the SagemakerEstimator:

from huggingface_hub import HfFolder
...
kwargs["environment"] = {
            **kwargs.get("environment", {}),
            "HF_TOKEN": HfFolder.get_token(),
        }

Assuming that kwargs are the arguments to be passed to the SagemakerEstimator, this will add the token to the environment variables.

Then, in the transformer training code:

from huggingface_hub import login
if os.getenv("HF_TOKEN"):
    print(f"Logging into the Hugging Face Hub with token...")
    login(token=os.getenv("HF_TOKEN"))

Subclassing

When I originally integrated with the Sagemaker HuggingFaceEstimator, I wrote a script to set up all the particular settings required to run it in my environment, like AWS VPC, Subnet, SecurityGroup, and RequiredTags for security and billing tracking in the company. Then, in every different project that I wanted to run a sagemaker training job in, I copied and pasted this code. The amount of code I had to copy was just small enough that I kept thinking "This probably isn't worth turning into a library". But then, a requirement changed related to the networking settings, and now I had to update code in around 10 locations. This is of course a classic mistake in software engineering: copying code when it should have been designed so that it is written once in a modular way so that it can be used by others instead of copied.

I went back and refactored the code so that I have a new CustomerEstimator class that is a subclass of the HuggingFaceEstimator, and it resulted in a much cleaner implementation. This is a pretty basic takeaway, but now all of the networking and common requirement code is located in a single place that all of my projects can use, and when a requirement changes, it only needs to be updated in that CustomEstimator class. In my defense, we are slightly set up for failure in this regard because all of the Huggingface tutorials for using Sagemaker do so by setting everything in a script instead of subclassing (I assume for sake of keeping the example as simple as possible), so if you copy and paste the huggingface example code to use as a starting point, you are probably going to be more likely to follow what I did and not have a class based design from the start.

Conclusion

I hope that this was helpful to learn about some obvious and not-so-obvious ways to use Sagemaker. Happy hacking!

Cover art generated in Canva with prompt "Computer thrown against brick wall by frustrated programmer"