Published on

Picking the right LLM deployment configuration in AWS


Bing Image Creator "three cybernetic roads in the wilderness diverge, I took the one with a nucleus at the end of it"

Deploying Large Language Models (LLMs) on AWS

Many Large Language Models (LLMs) are being trained with Artificial General Intelligence (AGI) in mind: the goal is to build an AI system that can achieve human level intelligence. Since they already have certain abilities to mimic human reasoning in a variety of tasks, there are a never-ending list of experiments to test if an LLM can simplify or improve a process that is currently expensive or time-intensive. Amazon Web Services (AWS) is one of the largest providers of cloud computing, and offers several options when it comes to hosting or using an LLM. The options can be overwhelming, and the best choice depends on your specific interest, budget, and the control you want to have over data and processing.

Although this is not a comprehensive list, in my opinion there are three successful options for running LLMs in AWS, which all have their own benefits:

  1. AWS Bedrock
  2. AWS Sagemaker Endpoint
  3. Elastic Compute 2 (EC2) accelerated computing with Docker

In this blog post we will explain the details about how to accomplish each approach, as well as a list of Pros and Cons for each design. We'll start with the easiest option where you have the least control over the technical details (Bedrock), and end with the most difficult option where you have the most control (EC2)

1. AWS Bedrock

AWS Bedrock is a fully managed service that simplifies the deployment and hosting of machine learning models, including LLMs. It completely abstracts away the underlying infrastructure, allowing you to focus on your experiment and not need to worry about how to get an LLM up and running.

With Bedrock, you are generally charged per-token, which means that you don't need to worry about how to provision the appropriate compute cluster to support the workload. The throughput can scale to however many requests you need to send per second.

As far as model availability, although there are ways to use custom trained LLMs, without much hassle they support a large amount of both open and proprietary models, such as Anthropic's Claude-3 and Meta's Llama-3.

How to Host the LLM

There is nothing to do other than ensure that you've enabled the usage of the model you want to try (e.g. Llama-3) in the AWS account

How to Use the LLM

AWS offers a suite of Software Development Kits (SDK) as well as command line interface (CLI) tools to support the use of bedrock. One example of a bedrock client wrapped inside of a UI can be seen here: Amazon Bedrock Client for Mac


  • Fully managed service, reducing operational overhead
  • No need to manage LLM infrastructure at all, charge is per token
  • Scales seamlessly if you need to send more requests in parallel
  • State-of-the-art LLM offerings


  • Complexity grows when customization is needed for advanced use cases
  • Higher cost compared to self-managed solutions, if you reach a higher level of consistent usage, it may be cheaper to host separate infrastructure, especially if speed is not the most important factor
  • Potential vendor lock-in: this ties closesly into the AWS ecosystem.

2. AWS SageMaker Endpoint

AWS SageMaker is a comprehensive machine learning service that provides tools for building, training, and deploying models, including LLMs. It can often be confusing but "Sagemaker Endpoint" specifically refers to one component of sagemaker, which handles deploying machine learning models. If the model was trained via a Sagemaker Training Job (a different part of the Sagemaker suite), deployment is as simple as specifying the S3 bucket and key where the model.tar.gz file is located. With SageMaker Endpoints, you can host your LLM and serve it as a real-time inference endpoint on the specified hardware and scale it as needed. For example, Phil Schmid has excellent tutorials for deploying LLMs to Sagemaker Deploy Llama 3 on Amazon SageMaker

How to Host the LLM

A sagemaker endpoint needs to be deployed to host the model. This can be done via the AWS Console, the sagemaker-python-sdk, or infrastructure as code tools like terraform.

How to Use the LLM

AWS offers a suite of Software Development Kits (SDK) as well as command line interface (CLI) tools to support the use of sagemaker. For example, the AWS Boto3 library can be used to invoke an endpoint, as seen here.


  • Seamless integration with SageMaker's training and deployment workflows
  • Flexible customization options for advanced use cases
  • Autoscaling and load balancing capabilities


  • Requires more setup and configuration compared to AWS Bedrock
  • Potentially higher operational overhead for self-management
  • Limited to AWS infrastructure

3. Running LLMs on AWS EC2 with Docker

The final popular option for model deployment is to run your LLM on an Amazon Elastic Compute Cloud (EC2) instance using a Docker container. The Huggingface (HF) inference container offers a simple docker interface in order for deployment of the LLM: https://github.com/huggingface/text-generation-inference. Especially compelling about the HF inference container is that it supports Tensor Parallelism, which can help to improve the token per second generation speed of LLMs by splitting model weights across GPUs in a way that allows for even GPU processing usage at all points of the processing algorithm (docs)

How to Host the LLM

It's all up to you. You are responsible for picking the EC2 instance type, getting the EC2 set up with all the appropriate tools (e.g. docker, python, etc), setting up the networking configuration so that you can reach the IP addresses where you need to retrieve the resources, and running and monitoring the model when it's running (for instance, using the nvidia-smi utility if using an NVIDIA GPU).

How to Use the LLM

It's all up to you. If you use the huggingface text-generation-inference container, they offer a python InferenceClient class that can ease the communication (link), but you could also go with directly using cUrl like they show here


  • Maximum flexibility and customization
  • No vendor lock-in
  • Potentially lower costs for self-management


  • Higher operational overhead for self-management
  • Increased complexity for load balancing, scaling, and monitoring
  • Potential security risks if not configured properly
  • Potentially higher costs if usage is low, since you generally are paying for the hours that the EC2 is turned on, not necessarily based on usage.

For Sagemaker Endpoint or EC2, What instance should I use?

This page offers a list of available instance types: https://docs.aws.amazon.com/ec2/latest/instancetypes/ac.html. For running medium size LLMs like Mistral-7B or Llama-7B, g6.12xlarge(4 x NVIDIA L4 GPU) or p4d.24xlarge (8 x NVIDIA A100 GPU) are popular choices, depending on budget constraints.


AWS Bedrock offers a fully managed solution for deploying LLMs with minimal operational overhead, while AWS SageMaker Endpoint provides more flexibility and integration with SageMaker's ecosystem. Running LLMs on AWS EC2 with Docker offers the highest level of customization but requires more self-management and operational effort. The choice depends on your specific requirements, such as scalability, customization needs, and operational capabilities.