Autoscaling Concourse workers with custom Prometheus metrics

If you’re operating a Concourse cluster on Kubernetes, you may or may not need to implement the autoscaling of Concourse workers to automatically handle expanding and contracting workloads. The Concourse helm chart supports using Kubernetes’ Horizontal Pod Autoscaler to enable autoscaling based on observed CPU utilization or custom metrics.

By default, each Concourse worker allows only 250 containers to run concurrently. When a worker reaches its max of 250 concurrent running containers, it is no longer able to take on additional tasks and when all of your workers reach that point, then your Concourse cluster is basically unusable. You could increase that limit but you should understand the implications before actually doing that. The other option would be to autoscale the Concourse workers based on the average number of concurrent running containers per worker.

Basically what we need is in the following order:

Expose Prometheus metrics from Concourse
Install Prometheus to collect metrics
Install the Prometheus Adapter to act as a metric API server to make custom metrics available to Kubernetes
Enable autoscaling on the Concourse side, which will create a HorizontalPodAutoscaler resource and utilize the custom metric made available to Kubernetes to autoscale the Concourse worker pods

There isn’t a lot of documentation out there specific to this use case (or at least I couldn’t find it), so hopefully this will be belpful to someone out there (or future me).

Enable Concourse to expose Prometheus metrics

I mentioned in a previous post that Concourse can be enabled to emit metrics about itself. It supports a few types of metric emitters, but I chose Prometheus since we had the most experience with it.

Install the Prometheus server

Once you’ve enabled your Concourse cluster to emit Prometheus metrics, if you already have a Prometheus server running in the same Kubernetes cluster, it’ll automatically find and collect those metrics. We installed Prometheus via its helm chart and the only component we needed here is the Prometheus server that will pull the metrics exposed by your services via the /metrics endpoints and store them in the time-series database for querying.

Here’s the values.yaml configuration we used for the Prometheus helm chart:

alertmanager:
  enabled: false

kubeStateMetrics:
  enabled: false

nodeExporter:
  enabled: false

pushgateway:
  enabled: false

server:
  nodeSelector:
    compute-load: prometheus
  persistentVolume:
    size: 20Gi
    storageClass: "${server_storage_class_name}"
  resources:
    requests:
      cpu: 2
      memory: 3Gi
  retention: "7d"

Install the Prometheus adapter

In order to make custom metrics available to Kubernetes, we need them to be exposed via Kubernetes’ custom metrics API. This is enabled via “adapter” API servers like the Prometheus Adapter. We installed the Prometheus Adapter via its helm chart and here is the values.yaml configuration we used:

prometheus:
  url: http://<name-of-prometheus-k8s-service>.<namespace>.svc.cluster.local
  port: 80

resources:
  requests:
    cpu: 200m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

rules:
  default: false
  custom:
    - seriesQuery: 'concourse_workers_containers{worker=~"concourse-worker-.*", kubernetes_namespace="concourse"}'
      resources:
        overrides:
          kubernetes_namespace: { resource: "namespace" }
          worker: { resource: "pod" }
      name:
        matches: "^(.*)"
        as: "${1}_avg"
      metricsQuery: 'avg_over_time(<<.Series>>{<<.LabelMatchers>>,worker=~"concourse-worker-.*"}[5m])'

This is where things got confusing and fuzzy for me to understand what exactly Kubernetes is expecting with this custom metric and I found the following references to be the most helpful:

A very extensive walkthrough of the configuration of the Prometheus Adapter
The Prometheus Adapter configuration reference which is also very helpful
And of course the official Kubernetes docs regarding Horizontal Pod Autoscaling and how it works. Personally the following paragraph was key to helping me understand what it was expecting in what resulted from the configured metricsQuery:

For per-pod resource metrics (like CPU), the controller fetches the metrics from the resource metrics API
for each Pod targeted by the HorizontalPodAutoscaler. Then, if a target utilization value is set, the controller
calculates the utilization value as a percentage of the equivalent resource request on the containers in each
Pod. If a target raw value is set, the raw metric values are used directly. The controller then takes the mean
of the utilization or the raw value (depending on the type of target specified) across all targeted Pods, and
produces a ratio used to scale the number of desired replicas.

For per-pod custom metrics, the controller functions similarly to per-pod resource metrics, except that it
works with raw values, not utilization values.

Enable autoscaling in Concourse

Now you’re finally ready to configure Concourse to create the HorizontalPodAutoscaler resource using the resulting concourse_workers_containers_avg metric. Since we installed Concourse via its helm chart, we just had to edit the values.yaml to add the following section:

concourse:
  ...
  worker:
    ...
    autoscaling:
      maxReplicas: 30
      minReplicas: 24
      customMetrics:
        - type: Pods
          pods:
            metric:
              name: concourse_workers_containers_avg
            target:
              type: AverageValue
              averageValue: 180

This basically creates a HorizontalPodAutoscaler (HPA) resource that would have a minimum of 24 Concourse worker pods, and if the average value of concourse_workers_containers_avg goes above 180 using the built-in scaling algorithm, it will slowly scale up the number of pods to accommodate the load. Since the maximum replicas is configured at 30, there will never be more than 30 Concourse worker pods running with this HPA configuration.

A few additional things to note

Set resource (cpu/memory) requests on your Concourse workers so that each worker pod knows how much resources it can reasonably use. If you’re using managed Kubernetes services like Google Kubernetes Engine (GKE), you should also configure autoscaling on the node pool that runs Concourse so that it can scale the nodes as more worker pods are added/removed.
You should set a reasonable value for terminationGracePeriodSeconds when configuring the Concourse workers as this will tell Kubernetes how long to wait to allow the workers to drain current tasks and retire themselves before terminating it. If you have pipelines that run long running jobs, you might have to sequester them to separate worker groups and opt them out of auto-scaling as we don’t want Kubernetes to terminate worker pods midway through running tasks.