Balancing Cost and Availability in ECS Using Capacity Providers

When you think container orchestration, you usually think Kubernetes.

AWS Elastic Container Service (ECS) is just as great an alternative, especially when you want to keep costs low and don’t want the extra bells and whistles that come with Kubernetes. ECS’ managed service approach is a big help too as it takes care of the heavy lifting behind scheduling, scaling, and load balancing. Developers can focus on building and maintaining their applications without getting bogged down by infrastructure management.

But when you create any plan, it pays to think about the worst-case scenarios and tailor your solution accordingly. This post describes one such scenario with ECS.

Building ECS Clusters on a Budget

We were tasked with architecting the infrastructure for a fairly large application in AWS keeping expenses low. The client also wanted us to use AWS-managed services wherever possible.

We decided to deploy the application across three separate ECS clusters, each managing close to 20 different services. When making this decision, we also considered the future requirements of the business. As and when new services are added, the system should be able to handle them.

While scaling the services, we kept the cluster itself fairly simple. We implemented Auto Scaling groups to dynamically adjust the number of EC2 instances based on the cluster's capacity requirements. During peak workloads, more instances would be added and during low-traffic periods, unnecessary instances would be removed, optimizing resource allocation and costs. To ensure high availability, we maintained a minimum of two instances when adding Auto Scaling.

Enter Spot Instances and Spot Termination Risks!

To further reduce the overall cost, we introduced Spot Instances into the mix. We planned for a 30:70 ratio of Spot Instances to on-demand EC2 servers. As you know, a Spot Instance is available at a lower cost compared to a dedicated server, with the caveat that the servers can be stopped at any time by AWS. In such cases, the Auto Scaling group would automatically launch a new instance, and any services running on the server would automatically be recreated on one of the other available servers.

While services should be spread out across all the servers in the cluster, there is no guarantee that the containers of the same service will never be launched on the same server. As a result, a situation could arise where all containers for a specific service could reside on a single instance. A single Spot Instance termination could potentially bring down the entire service for a brief period until they are recreated on one of the existing or newer instances.

We recently had an issue on a different service that utilized Spot Instances, wherein multiple Spot Instances went down at the same time because of an issue at AWS’ end. The issue itself lasted only for a few minutes, but it resulted in a few hours worth of operational effort to find the missed messages and replay them through the system. This experience was still vivid in our minds. So the risk of a few minutes of downtime with potentially hours of extra work was not acceptable to us.

To address this challenge, we considered various solutions. One was to create multiple services for the same task, essentially duplicating functionality. Splitting Auto Scaling groups seemed feasible, but managing them individually would be a pain. We even considered abandoning Spot Instances altogether, sacrificing cost efficiency for operational stability.

Cost Vs Efficiency: Hitting the Sweet Spot with Capacity Providers

The solution that provided the best balance was capacity providers. A capacity provider is a resource management service used in Amazon ECS and Amazon EKS (Elastic Kubernetes Service). It allows you to specify how your tasks should be placed in ECS or EKS. Capacity providers still use Auto Scaling groups in the backend but allow you to seamlessly connect the group to the cluster, without having to manage the groups yourself.

We created two capacity providers for each cluster:

cp-normal: This provider is configured exclusively for on-demand EC2 instances, ensuring stable and predictable resources throughout its lifecycle.
cp-spot: This provider leverages Spot Instances, offering cost-efficiency.

For each ECS service, we implemented a dedicated capacity provider strategy, which dictates how containers are placed across the specified providers. We utilized two key parameters within this strategy:

Base: This parameter defines the minimum number of containers that must be launched on a specific provider. We set a base value of 2 for the cp-normal provider to ensure service availability via the on-demand servers.

Weight: This parameter defines the relative weightage for each provider when launching new containers. We initially configured a weight of 7 for cp-normal and 3 for cp-spot.

We created a capacity provider strategy for each service as follows:

Capacity Provider	Base	Weight
cp-normal	2	7
cp-spot	0	3

As per the above configuration, the first two containers were always placed on the on-demand EC2 instances. The next containers were then placed so that the weight of 7:3 was maintained. The third container went to a Spot Instance, the 4th to the normal, the 5th to the spot, and so on.

This configuration prioritized on-demand instances (cp-normal) due to the defined base value of 2. However, it would also place containers onto Spot Instances based on the weightage, allowing us to take advantage of the lower cost of Spot Instances. This approach ensured a more balanced container distribution across the cluster while eliminating the risk of service disruptions due to Spot Instance termination.

Bonus Advantage: Managed Scaling

An added bonus of utilizing capacity providers was managed scaling. Had we not used capacity providers, managing clusters with Auto Scaling groups would have been a nightmare. We would have had to juggle multiple metrics to determine when to scale up or down. If down the line each cluster required a different set of thresholds, these policies would have to be recalculated each time. This was an extra burden we had decided to bear in the case of normal Auto Scaling groups.

Capacity providers afforded a simpler approach. ECS automatically allocates and deallocates resources within the cluster based on its requirements using a target tracking scaling policy. This policy continuously monitors a specific metric named CapacityProviderReservation and automatically adjusts the number of instances within the capacity provider to maintain the desired target level.

Key Benefits Achieved

Here’s a recap of the benefits we gained using capacity providers:

Balanced container distribution: Separating instances into distinct capacity providers and utilizing placement strategies led to an even distribution of containers while removing the risk of service disruptions by Spot Instance terminations.
Cost optimization: Since the first two containers were always available on the on-demand servers, we could change the weightage as desired to further reduce the cost. For less critical clusters, we switched to a weight of 1:1. So at peak loads, half of all the containers would be on Spot Instances while the remaining half would be on on-demand instances.
Evolving with AWS: As AWS continues to develop and enhance capacity providers, we can leverage the latest capabilities to further optimize our containerized deployments.

Are you struggling with a cloud or DevOps challenge? We can help you sort it out. Just drop us a line!

Cloud Enablement

Data & AI

Digitalization

End-to-End

Digital Marketing

SaaS

Retail

Healthcare

Hospitality

Insurance

Productivity

Technology

Marketing

START A CONVERSATION