Kubernetes workers»

Spacelift provides a Kubernetes operator for managing worker pools. The operator also works on OpenShift. This operator allows you to:

Define WorkerPool resources in your cluster.
Scale these pools up and down using standard Kubernetes functionality.

Info

The Docker-in-Docker approach is no-longer recommended. Use the Kubernetes operator instead. Please see the section on migrating from Docker-in-Docker for more information.

A WorkerPool defines the number of Workers registered with Spacelift via the poolSize parameter. The Spacelift operator will automatically create and register a number of Worker resources in Kubernetes depending on your poolSize.

Idle resource usage»

Worker resources do not use up any cluster resources other than an entry in the Kubernetes API when they are idle.

Pods are created on demand for Workers when scheduling messages are received from Spacelift. This means that in an idle state, no additional resources are being used in your cluster other than what is required to run the controller component of the Spacelift operator.

Kubernetes version compatibility»

The Spacelift controller is compatible with Kubernetes version v1.26+. The controller may also work with older versions, but we do not guarantee and provide support for unmaintained Kubernetes versions.

Installation»

Controller setup»

KubectlHelm

To install the worker pool controller along with its CRDs, run this command:

kubectl apply --server-side -f https://downloads.spacelift.io/kube-workerpool-controller/latest/manifests.yaml

Warning

It is important to use the --server-side flag here because our CRD definitions contain long field descriptions.

Kubernetes sets the kubectl.kubernetes.io/last-applied-configuration annotation, and the size of the CRD exceeds the maximum size of an annotation field, which causes the apply to fail (as detailed in this kubebuilder issue).

Tip

You can download the manifests yourself from https://downloads.spacelift.io/kube-workerpool-controller/latest/manifests.yaml if you would like to inspect them or alter the Deployment configuration for the controller.

Install the controller using the official spacelift-workerpool-controller Helm chart.

helm repo add spacelift https://downloads.spacelift.io/helm
helm repo update
helm upgrade spacelift-workerpool-controller spacelift/spacelift-workerpool-controller --install --namespace spacelift-worker-controller-system --create-namespace

You can open values.yaml from the Helm chart repository for more customization options.

Upgrading from chart versions prior to v0.58.0

Starting with v0.58.0, the Helm chart manages CRDs via a subchart instead of the crds/ directory. Before upgrading from an older version, you must label and annotate each existing CRD so Helm can adopt them:

for crd in workerpools.workers.spacelift.io workers.workers.spacelift.io; do \
  kubectl label crd "${crd}" 'app.kubernetes.io/managed-by=Helm' && \
  kubectl annotate crd "${crd}" 'meta.helm.sh/release-name=spacelift-workerpool-controller' && \
  kubectl annotate crd "${crd}" 'meta.helm.sh/release-namespace=spacelift-worker-controller-system'
done

Failure to complete this step before upgrading will result in Helm conflicts with the pre-existing CRD resources.

Prometheus metrics

The controller also has a subchart for our prometheus-exporter project that exposes metrics in OpenMetrics spec. This is useful for scaling workers based on queue length in Spacelift (spacelift_worker_pool_runs_pending metric).

To install the controller with the prometheus-exporter subchart, use the following command:

helm upgrade spacelift-workerpool-controller spacelift/spacelift-workerpool-controller --install --namespace spacelift-worker-controller-system --create-namespace \
--set spacelift-promex.enabled=true \
--set spacelift-promex.apiEndpoint="https://{yourAccount}.app.spacelift.io" \
--set spacelift-promex.apiKeyId="{yourApiToken}" \
--set spacelift-promex.apiKeySecretName="spacelift-api-key"

Read more on the exporter on its repository here and see more config options in the values.yaml file for the subchart.

OpenShift»

If you are using OpenShift, additional steps are needed for the controller to run properly.

Get the controllers service account name:

kubectl get serviceaccounts -n {namespace_of_controller}

Add the anyuid security context constraint to the service account:

oc adm policy add-scc-to-user anyuid -z {service_account_name} -n {namespace_of_controller} --as system:admin

Create a Spacelift Admin role in the namespace where your worker pods will run (this may be different from the namespace you installed the controller into):
1
oc create role spacelift-admin --verb='*' --resource='*' -n {namespace_of_worker_pods}

Bind the role to the controller's service account:

oc create rolebinding spacelift-admin-binding --role=spacelift-admin --serviceaccount={namespace_of_controller}:{service_account_name} -n {namespace_of_worker_pods}

Ensure the controller can indeed use the Kubernetes API in the namespace where your worker pods will run:
1
oc auth can-i '*' '*' --as=system:serviceaccount:{namespace_of_controller}:{service_account_name} -n {namespace_of_worker_pods}
This should return yes if everything is set up correctly.
Restart the worker controller pod to make sure it picks up the new permissions.
1
kubectl rollout restart deployments -n {namespace_of_controller}

Create a WorkerPool»

We recommend deploying worker pools with auto-registration.

With OIDC secret configuration, you can also avoid storing static Spacelift credentials in the cluster.

If you don't want to use auto-registration, create the WorkerPool manually in Spacelift and save its secrets on the cluster.

Auto-registration»

With auto-registration, the controller automatically creates and manages worker pools in Spacelift without requiring manual setup steps in the UI.

When you create a WorkerPool resource without token and privateKey credentials, the controller handles the complete lifecycle: it registers the pool with Spacelift, generates the required credentials, stores them securely in Kubernetes secrets, and manages ongoing operations.

This approach enables true GitOps workflows where worker pools can be provisioned declaratively alongside other infrastructure. There's no need to coordinate between the Spacelift UI and your Kubernetes deployment, eliminating potential errors and simplifying automation.

Warning

When using auto-registration, you cannot update and reset the workerpool from the Spacelift UI. This makes it obvious that the pool is managed from the cluster, and avoid conflicts by forcing a single source of truth.

Create an API key»

For auto-registration to work, you need to create a Spacelift API key to allow the controller to manage worker pools in Spacelift.

This key should be granted the Worker pool controller role for the space that your worker pool will be created in, and needs to be stored in a secret called spacelift-api-credentials in the same namespace as the Kubernetes controller (by default spacelift-worker-controller-system).

Regular API keyOIDC API key

Create a secret-based API key with the "Worker pool controller" system role and assign it to the space(s) where you want to create worker pools.

built in system role

Grant your API key(s) role-based access to Spacelift.

After creating the key, store the credentials in a Kubernetes secret in the controller's namespace:

kubectl create secret generic spacelift-api-credentials \
  --from-literal=keyId=<your-api-key-id> \
  --from-literal=keySecret=<your-api-key-secret> \
  --from-literal=endpoint=https://<your-account>.app.spacelift.io \
  --namespace spacelift-worker-controller-system

Create an OIDC-based API key configured to trust your cluster's OIDC provider, with the "Worker pool controller" system role assigned to the appropriate space(s). For detailed OIDC integration setup, see the OIDC documentation.

built in system role

After creating the key, store only the key ID and endpoint:

kubectl create secret generic spacelift-api-credentials \
  --from-literal=keyId=<your-oidc-api-key-id> \
  --from-literal=endpoint=https://<your-account>.app.spacelift.io \
  --namespace spacelift-worker-controller-system

The controller will use its service account's OIDC token to authenticate with Spacelift. The OIDC API key Client ID (audience) must exactly match the token's aud claim, which can vary by Kubernetes platform.

Verifying token claims

To inspect the actual token claims used by the controller, run:

TOKEN="$(kubectl exec -n spacelift-worker-controller-system <controller-pod> -- cat /var/run/secrets/kubernetes.io/serviceaccount/token)"
echo "$TOKEN" | cut -d. -f2 | base64 -d 2>/dev/null | jq '{iss,aud,sub}'

The aud value shown must match the Client ID configured in your Spacelift OIDC API key.

Info

The cluster OIDC endpoint should be reachable from Spacelift. Make sure your ingress network configuration allows that.

EKS OIDC Setup example»

To configure OIDC authentication for EKS, you need the cluster's OIDC issuer URL, which you can retrieve with:

aws eks describe-cluster --name <cluster-name> --query "cluster.identity.oidc.issuer" --output text

This returns a URL like https://oidc.eks.eu-central-1.amazonaws.com/id/123451234512345123451234512345.

When creating the Spacelift OIDC API key, use:

Issuer: The OIDC issuer URL from above
Client ID (audience): https://kubernetes.default.svc
Subject Expression: ^system:serviceaccount:NAMESPACE:SERVICE_ACCOUNT_NAME$ (replace NAMESPACE and SERVICE_ACCOUNT_NAME with yours)

The controller's service account token contains these claims, allowing it to authenticate with Spacelift without any static credentials.

Create WorkerPool»

To create an auto-registered worker pool, deploy a WorkerPool resource without the token and privateKey fields:

kubectl apply -f - <<EOF
apiVersion: workers.spacelift.io/v1beta1
kind: WorkerPool
metadata:
  name: auto-registered-pool
spec:
  poolSize: 2
EOF

You can refer to the WorkerPool CRD for all optional fields. There are fields specific for auto-registration that configures how your pool is setup in Spacelift.

Manual Registration»

While auto-registration is recommended, you can manually create the WorkerPool. Create a worker pool in the Spacelift UI, get credentials for it, and configure them in Kubernetes.

Create a Secret»

Create a Secret containing the private key and token for your worker pool, generated earlier in this guide.

First, export the token and private key as base64-encoded strings:

MacOSLinux

export SPACELIFT_WP_TOKEN=$(cat ./your-workerpool-config-file.config)
export SPACELIFT_WP_PRIVATE_KEY=$(cat ./your-private-key.pem | base64 -b 0)

export SPACELIFT_WP_TOKEN=$(cat ./your-workerpool-config-file.config)
export SPACELIFT_WP_PRIVATE_KEY=$(cat ./your-private-key.pem | base64 -w 0)

Then, create the secret.

kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: test-workerpool
type: Opaque
stringData:
  token: ${SPACELIFT_WP_TOKEN}
  privateKey: ${SPACELIFT_WP_PRIVATE_KEY}
EOF

Create a WorkerPool»

Finally, create a WorkerPool resource using this command:

kubectl apply -f - <<EOF
apiVersion: workers.spacelift.io/v1beta1
kind: WorkerPool
metadata:
  name: test-workerpool
spec:
  poolSize: 2
  token:
    secretKeyRef:
      name: test-workerpool
      key: token
  privateKey:
    secretKeyRef:
      name: test-workerpool
      key: privateKey
EOF

Info

You can deploy the controller globally (the default option) to monitor all namespaces, allowing worker pools in multiple namespaces, or restrict it to specific namespaces using the namespaces option in the Helm chart values.

The namespace of the controller and workers themselves doesn’t impact functionality.

The workers in your pool should connect to Spacelift, and you should be able to trigger runs.

Upgrade»

Usually, there is nothing special to do for upgrading the controller.

Some release of the controller may include backward compatibility breaks, you can find below instructions about how to upgrade for those special versions.

Upgrading to controller v0.0.27 - or Helm chart v0.52.0»

This release introduces auto-registration support and requires additional RBAC permissions for the controller to manage secrets. The upgrade is backward compatible with existing worker pools.

If you installed the controller using the Helm chart, the RBAC permissions are automatically updated during the upgrade. The same applies if you installed using kubectl with raw manifests. The updated permissions are included in the manifests.

No action is required for existing manually registered worker pools, and they will continue to work exactly as before.

The new auto-registration feature is opt-in, and only activates when the following conditions are both true:

You create a WorkerPool resource without specifying the token and privateKey.
You provide a spacelift-api-credentials secret in the same namespace as your controller containing your API credentials.

Upgrading to controller v0.0.17 - or Helm chart v0.33.0»

This release changes the way the controller exposes metrics by removing usage of the kube-rbac-proxy container.

You can find more context about the reason for this change in the Kubebuilder repository.

KubectlHelm

If the controller was installed using compiled Kubernetes manifest using kubectl apply -f ..., you should first uninstall the current release before deploying the new one.

Warning

The command below will remove CRDs and thus also remove your WorkerPool from the cluster. Before running it, make sure that you'll be able to recreate them after the upgrade.

# Scale down all your workerpools to zero, and make sure there is no remaining Worker resource in the cluster.
# Otherwise the kubectl delete function below will be stuck and you'll have to remove finalizers by hand on Workers.
kubectl scale workerpool/${WORKERPOOL_NAME} --replicas 0
# If your're using v0.0.16, change the version to the one that is currently deployed in your cluster.
kubectl delete -f https://downloads.spacelift.io/kube-workerpool-controller/v0.0.16/manifests.yaml

Then you can install the new controller version with this command.

kubectl apply -f https://downloads.spacelift.io/kube-workerpool-controller/v0.0.17/manifests.yaml

CRDs have been updated in this new version, and Helm does not perform CRDs update for us. Before upgrading to the latest version of the chart, you should execute these commands to upgrade CRDs.

kubectl apply -f https://raw.githubusercontent.com/spacelift-io/spacelift-helm-charts/refs/tags/v0.33.0/spacelift-workerpool-controller/crds/worker-crd.yaml
kubectl apply -f https://raw.githubusercontent.com/spacelift-io/spacelift-helm-charts/refs/tags/v0.33.0/spacelift-workerpool-controller/crds/workerpool-crd.yaml

Once done, you can upgrade the chart as usual with helm upgrade.

Run containers»

When a run assigned to a Kubernetes worker is scheduled by Spacelift, the worker pool controller creates a new pod to process the run. This pod consists of the following containers:

init: Responsible for populating the workspace for the run.
launcher-grpc: Runs a gRPC server used by the worker for certain tasks like uploading the workspace between run stages, and notifying the worker when a user has requested that the run be stopped.
worker: Executes your run.

The init and launcher-grpc containers use the public.ecr.aws/spacelift/launcher:<version> container image published by Spacelift. By default, the Spacelift backend sends the correct value for <version> through to the controller for each run, guaranteeing that the run is pinned to a specific image version that is compatible with the Spacelift backend.

The worker container uses the runner image specified by your Spacelift stack.

Warning

You can use the spec.pod.launcherImage configuration option to pin the init and launcher-grpc containers to a specific version. However, we don't recommend doing this because your run pods could become incompatible with the Spacelift backend as new versions are released.

Resource usage»

Kubernetes controller»

During normal operations the worker pool controller CPU and memory usage should be fairly stable. The main operation that can be resource intensive is scaling out a worker pool. Scaling up involves generating an RSA keypair for each worker, and is CPU-bound. If you notice performance issues when scaling out, try giving the controller more CPU.

Run pods»

Resource requests and limits for the init, launcher-grpc, and worker containers can be set via your WorkerPool definitions:

apiVersion: workers.spacelift.io/v1beta1
kind: WorkerPool
metadata:
  name: test-pool
spec:
  poolSize: 2
  token:
    secretKeyRef:
      name: pool-credentials
      key: token
  privateKey:
    secretKeyRef:
      name: pool-credentials
      key: privateKey
  pod:
    initContainer:
      resources:
        requests:
          cpu: 500m
          memory: 200Mi
        # Please note: we recommend being very cautious when adding resource limits
        # to your containers. Setting too low a limit on the init container can cause
        # runs to fail during the preparing phase.
        # limits:
        #   cpu: 100m
        #   memory: 50Mi
    grpcServerContainer:
      resources:
        requests:
          cpu: 500m
          memory: 200Mi
        # Setting too low limits on the grpc server container can cause runs to fail
        # when moving into the unconfirmed stage, as well as problems like not being
        # able to stop/cancel runs.
        # limits:
        #   cpu: 100m
        #   memory: 50Mi
    workerContainer:
      resources:
        requests:
          cpu: 500m
          memory: 200Mi
        # Setting too low limits on the worker container can cause problems executing
        # your IaC tool (e.g. OpenTofu, Terraform, etc), causing runs to fail during
        # planning, applying or destroying phases.
        # limits:
        #   cpu: 500m
        #   memory: 200Mi

You can use the example values above to get started, but the exact values you need for your pool will depend on your individual circumstances. You should use monitoring tools to adjust to values that make the most sense.

Warning

In general, we don't suggest setting very low CPU or memory limits for the init, grpc, or worker containers since doing so could affect the performance of runs, or even cause runs to fail if they are set too low.

In particular, the worker container resource usage will very much depend on your workloads. For example stacks with large numbers of Terraform resources may use more memory than smaller stacks.

Volumes»

There are two volumes that are always attached to your run pods:

The workspace.
The binaries cache.

Both of these volumes default to using emptyDir storage with no size limit. Spacelift workers will function correctly without using a custom configuration for these volumes, but there may be situations where you wish to change this default, for example:

To prevent Kubernetes evicting your run pods due to disk pressure (and therefore causing runs to fail).
To support caching tool binaries (for example Terraform or OpenTofu) between runs.

Workspace volume»

The workspace volume is used to store the temporary workspace data needed for processing a run. This includes metadata about the run, along with your source code. The workspace volume does not need to be shared or persisted between runs, and for that reason we recommend using an ephemeral volume so that the volume is bound to the lifetime of the run, and will be destroyed when the run pod is deleted.

The workspace volume can be configured via the spec.pod.workspaceVolume property, which accepts a standard Kubernetes volume definition. Here's an example of using an ephemeral AWS GP2 volume for storage:

apiVersion: workers.spacelift.io/v1beta1
kind: WorkerPool
metadata:
  name: worker-pool
spec:
  poolSize: 1
  privateKey:
    secretKeyRef:
      key: privateKey
      name: pool-credentials
  token:
    secretKeyRef:
      key: token
      name: pool-credentials
  pod:
    securityContext:
      # The fsGroup may or may not be required depending on your volume type. The reason for
      # specifying it is because the containers in the run pods run as the Spacelift (UID 1983)
      # user. Depending on the volume type in use, you may experience permission errors during
      # runs if the fsGroup is not specified.
      fsGroup: 1983

    # The workspaceVolume property is used to specify the volume to use for the run's workspace.
    workspaceVolume:
      name: workspace
      ephemeral:
        volumeClaimTemplate:
          spec:
            accessModes:
            - ReadWriteOnce
            resources:
              requests:
                storage: 1Gi
            storageClassName: gp2

Binaries cache volume»

The binaries cache volume is used to cache binaries (e.g. terraform and kubectl) across multiple runs. You can use an ephemeral volume for the binaries cache like with the workspace volume, but doing so will not result in any caching benefits. To be able to share the binaries cache with multiple run pods, you need to use a volume type that supports ReadWriteMany, for example AWS EBS EC2 Multi-Attach.

To configure the binaries cache volume, you can use exactly the same approach as with the workspace volume, the only difference is that you should use the spec.pod.binariesCacheVolume property instead of spec.pod.workspaceVolume.

Custom volumes»

See configuration for more details on how to configure these two volumes along with any additional volumes you require.

Configuration»

The following example shows all the configurable options for a WorkerPool:

apiVersion: workers.spacelift.io/v1beta1
kind: WorkerPool
metadata:
  # name defines the name of the pool in Kubernetes - does not need to match the name in Spacelift.
  name: test-workerpool
spec:
  # poolSize specifies the current number of Workers that belong to the pool.
  # Optional, defaults to 1 if not provided.
  poolSize: 2

  # token points at a Kubernetes Secret key containing the worker pool token.
  # Optional - required for manual registration, omit for auto-registration.
  # Must be set together with privateKey, or both must be unset.
  token:
    secretKeyRef:
      name: test-workerpool
      key: token

  # privateKey points at a Kubernetes Secret key containing the worker pool private key.
  # Optional - required for manual registration, omit for auto-registration.
  # Must be set together with token, or both must be unset.
  privateKey:
    secretKeyRef:
      name: test-workerpool
      key: privateKey

  # commsProtocol selects the worker transport: "mqtt" (default) or "poll" (HTTP long-poll).
  # See "Worker communication" below.
  # Optional, defaults to mqtt.
  commsProtocol: poll

  # pollCommsURL is the base URL of the Spacelift server. Required when commsProtocol is "poll".
  # Optional

  pollCommsURL: https://app.spacelift.io


  # space allows you to specify which Spacelift space to create the pool in.
  # Only applies to auto-registered pools.
  # Optional
  space: production-01ARZ3NDEKTSV4RRFFQ69G5FAV

  # description sets a description for the worker pool.
  # Useful for documentation and organization in the Spacelift UI.
  # Only applies to auto-registered pools.
  # Optional
  description: Production worker pool for infrastructure deployments

  # driftDetectionRunLimits configures limits for drift detection runs executed on this worker pool.
  # Only applies to auto-registered pools.
  # Optional
  driftDetectionRunLimits:
    # disabled indicates whether drift detection runs are disabled for this worker pool.
    # When true, maxRuns must not be set.
    # Optional, defaults to false
    disabled: false
    # maxRuns specifies the maximum number of drift detection runs allowed.
    # Cannot be set when disabled is true.
    # Optional when disabled is false
    maxRuns: 5

  # allowedRunnerImageHosts defines the hostnames of registries that are valid to use stack
  # runner images from. If no specified images from any registries are allowed.
  # Optional
  allowedRunnerImageHosts:
    - docker.io
    - some.private.registry

  # Pod history management configuration
  # These settings control how many completed pods are retained and for how long

  # successfulPodsHistoryLimit specifies the number of successful Pods to keep for inspection purposes.
  # When set to a positive number, only the N most recent successful Pods are kept per worker.
  # When set to 0, all successful Pods are removed immediately.
  # When unset and successfulPodsHistoryTTL is also unset, defaults to 0 (remove all).
  # When unset but successfulPodsHistoryTTL is set, count-based cleanup is disabled (TTL-only).
  # Optional
  successfulPodsHistoryLimit: 0

  # failedPodsHistoryLimit specifies the number of failed Pods to keep for debugging purposes.
  # When set to a positive number, only the N most recent failed Pods are kept per worker.
  # When set to 0, all failed Pods are removed immediately.
  # When unset and failedPodsHistoryTTL is also unset, defaults to 5 (keep 5 most recent).
  # When unset but failedPodsHistoryTTL is set, count-based cleanup is disabled (TTL-only).
  # Optional
  failedPodsHistoryLimit: 5

  # successfulPodsHistoryTTL specifies the duration to keep successful Pods after they are created.
  # When set, successful Pods that have been created for longer than this duration are removed.
  # The TTL timer starts from Pod creation time for consistency with history limit ordering.
  # Running pods are never affected by this TTL.
  # When unset (nil), no time-based cleanup is performed for successful Pods.
  # This works in combination with successfulPodsHistoryLimit - pods are removed if they exceed EITHER limit.
  # Optional
  successfulPodsHistoryTTL: "24h"

  # failedPodsHistoryTTL specifies the duration to keep failed Pods after they are created.
  # When set, failed Pods that have been created for longer than this duration are removed.
  # The TTL timer starts from Pod creation time for consistency with history limit ordering.
  # Running pods are never affected by this TTL.
  # When unset (nil), no time-based cleanup is performed for failed Pods.
  # This works in combination with failedPodsHistoryLimit - pods are removed if they exceed EITHER limit.
  # Optional
  failedPodsHistoryTTL: "72h"

  # pod contains the spec of Pods that will be created to process Spacelift runs. This allows
  # you to set things like custom resource requests and limits, volumes, and service accounts.
  # Most of these settings are just standard Kubernetes Pod settings and are not explicitly
  # explained below unless they are particularly important or link directly to a Spacelift
  # concept.
  # Optional
  pod:
    # activeDeadlineSeconds defines the length of time in seconds before which the Pod will
    # be marked as failed. This can be used to set a deadline for your runs. The default is
    # 70 minutes.
    activeDeadlineSeconds: 4200

    terminationGracePeriodSeconds: 30

    # volumes allows additional volumes to be attached to the run Pod. This is an array of
    # standard Kubernetes volume definitions.
    volumes: []

    # binariesCacheVolume is a special volume used to cache binaries like tool downloads (e.g.
    # terraform, kubectl, etc). These binaries can be reused by multiple runs, and potentially
    # by multiple workers in your pool. To support this you need to use a volume type that
    # can be read and written to by multiple Pods at the same time.
    # It's always mounted in the same path: /opt/spacelift/binaries_cache
    binariesCacheVolume: null

    # workspaceVolume Special volume shared between init containers and the worker container.
    # Used to populate the workspace with the repository content.
    # It's always mounted in the same path: /opt/spacelift/workspace
    # IMPORTANT: when using a custom value for this volume bear in mind that data stored in it is sensitive.
    # We recommend that you make sure this volume is ephemeral and is not shared with other pods.
    workspaceVolume: null

    # DefaultAnsibleRunnerImage overrides the default runner image for Ansible runs.
    # When set, this image will be used instead of the backend-provided runner image
    # for Ansible stacks.
    # Default: public.ecr.aws/spacelift/runner-ansible:latest
    defaultAnsibleRunnerImage: "my-custom-ansible-image:latest"


    # DefaultRunnerImage overrides the default runner image for non-Ansible runs.
    # When set, this image will be used instead of the backend-provided runner image
    # for Terraform and other non-Ansible stacks.
    # Default: public.ecr.aws/spacelift/runner-terraform:latest
    defaultRunnerImage: "my-custom-image:latest"

    serviceAccountName: "custom-service-account"
    automountServiceAccountToken: true
    securityContext: {}
    imagePullSecrets: []
    nodeSelector: {}
    nodeName: ""
    affinity: {}
    schedulerName: ""
    tolerations: []
    hostAliases: []
    dnsConfig: {}
    runtimeClassName: ""
    topologySpreadConstraints: []
    labels: {}
    annotations: {}

    # customBinariesPath allows you to add additional directories to the start of the path used
    # by the worker. This allows you to do things like use a custom tool version provided on the
    # runner image instead of the version downloaded by Spacelift.
    customBinariesPath: ""

    # customInitContainers allow you to define a list of custom init containers to be run before the builtin init one.
    customInitContainers: []

    # launcherImage allows you to customize the container image used by the init and gRPC server
    # containers. NOTE that by default the correct image is sent through to the controller
    # from the Spacelift backend, ensuring that the image used is compatible with the current
    # version of Spacelift.
    #
    # You can use this setting if you want to use an image stored in a container registry that
    # you control, but please note that doing so may cause incompatibilities between run containers
    # and the Spacelift backend, and we do not recommend this.
    launcherImage: ""

    # initContainer defines the configuration for the container responsible for preparing the
    # workspace for the worker. This includes downloading source code, performing role assumption,
    # and ensuring that the correct tools are available for your stack amongst other things.
    # The container name is "init".
    initContainer:
      envFrom: []
      env: []
      volumeMounts: []
      resources:
        requests:
          # Standard resource requests
        limits:
          # Standard request limits
        claims: []
      # SecurityContext defines the security options the container should be run with.
      # ⚠️ Overriding this field may cause unexpected behaviors and should be avoided as much as possible.
      # The operator is configured to run in a least-privileged context using UID/GID 1983. Running it as root may
      # lead to unexpected behavior. Use at your own risk.
      securityContext: {}

    # grpcServerContainer defines the configuration for the side-car container used by the
    # worker container for certain actions like uploading the current workspace, and being
    # notified of stop requests.
    # The container name is "launcher-grpc".
    grpcServerContainer:
      envFrom: []
      env: []
      volumeMounts: []
      resources:
        requests:
          # Standard resource requests
        limits:
          # Standard request limits
        claims: []
      # SecurityContext defines the security options the container should be run with.
      # ⚠️ Overriding this field may cause unexpected behaviors and should be avoided as much as possible.
      # The operator is configured to run in a least-privileged context using UID/GID 1983. Running it as root may
      # lead to unexpected behavior. Use at your own risk.
      securityContext: {}

    # workerContainer defines the configuration for the container that processes the workflow
    # for your run. This container uses the runner image defined by your stack.
    workerContainer:
      envFrom: []
      env: []
      volumeMounts: []
      resources:
        requests:
          # Standard resource requests
        limits:
          # Standard request limits
        claims: []
      # SecurityContext defines the security options the container should be run with.
      # ⚠️ Overriding this field may cause unexpected behaviors and should be avoided as much as possible.
      # The operator is configured to run in a least-privileged context using UID/GID 1983. Running it as root may
      # lead to unexpected behavior. Use at your own risk.
      securityContext: {}

    # additionalSidecarContainers allows you to add any custom container to the pod.
    # If an additional container is running a long-running process like a database or a daemon,
    # it will be terminated when the spacelift run succeed.
    additionalSidecarContainers:
      # Every entry of this array needs to follow the kubernetes container spec.
      - name: redis
        image: redis

Pod history management»

The Kubernetes operator provides flexible pod history management to control how long completed run pods are retained. This allows you to balance between debugging capabilities and resource usage.

Overview»

The pod history management system supports both count-based and time-based cleanup strategies that work together:

Count-based limits: Control how many completed pods to keep per worker.
Time-based cleanup (TTL): Automatically remove pods after a specified duration.
Combined strategy: Pods are removed when they exceed either the count limit or the time limit.

Default behavior»

Successful pods: Removed immediately (limit: 0).
Failed pods: Keep five most recent (limit: 5).
Time limits: No automatic TTL cleanup unless explicitly configured.

Configuration options»

Count-based limits»

spec:
  # Keep 3 most recent successful pods per individual worker
  successfulPodsHistoryLimit: 3

  # Keep 10 most recent failed pods per individual worker
  failedPodsHistoryLimit: 10

Time-based cleanup (TTL)»

spec:
  # Remove successful pods older than 24 hours
  successfulPodsHistoryTTL: "24h"

  # Remove failed pods older than 72 hours
  failedPodsHistoryTTL: "72h"

Special modes»

Delete-all mode: Set limit to 0 to remove all pods immediately.

spec:
  successfulPodsHistoryLimit: 0  # Remove all successful pods immediately
  failedPodsHistoryLimit: 0      # Remove all failed pods immediately

TTL-only mode: Set TTL without limit for time-based cleanup only.

spec:
  # Only time-based cleanup, no count limits
  successfulPodsHistoryTTL: "48h"
  # successfulPodsHistoryLimit is intentionally unset

Behavior details»

Pod selection: Uses creation time for consistent ordering (oldest removed first).
Running pods: Never affected by cleanup, only applies to completed pods.
WorkerPool scope: Cleanup is managed centrally at the WorkerPool level across all workers in the pool.
Deletion safety: Pods already being deleted are excluded from counts.

Migration from keepSuccessfulPods

The keepSuccessfulPods field has been deprecated since controller version v0.0.25 and Helm chart version 0.46.0, and has been removed in favor of the new pod history management system. If you previously used keepSuccessfulPods: true, set successfulPodsHistoryLimit to a positive value instead.

Configure a Docker daemon as a sidecar container»

If you need to have a Docker daemon running as a sidecar, you can follow the example below.

apiVersion: workers.spacelift.io/v1beta1
kind: WorkerPool
metadata:
  name: test-workerpool
spec:
  poolSize: 2
  pod:
    workerContainer:
      env:
        - name: DOCKER_HOST
          value: tcp://localhost:2375
    additionalSidecarContainers:
      - image: docker:dind
        name: docker
        securityContext:
          privileged: true
        command:
          - docker-init
          - "--"
          - dockerd
          - "--host"
          - tcp://127.0.0.1:2375

Timeouts»

There are two types of timeouts that you can set:

Run: Causes the run to fail if its duration exceeds a defined duration.
Log output: Causes the run to fail if no logs has been generated for a defined duration.

Run timeout»

You need to configure two items: the activeDeadlineSeconds for the pod and the SPACELIFT_LAUNCHER_RUN_TIMEOUT for the worker container:

apiVersion: workers.spacelift.io/v1beta1
kind: WorkerPool
metadata:
  name: test-workerpool
spec:
  pod:
    activeDeadlineSeconds: 3600
    workerContainer:
      env:
        - name: SPACELIFT_LAUNCHER_RUN_TIMEOUT
          value: 3600s # This is using the golang duration format, more info here https://pkg.go.dev/time#ParseDuration

Log output timeout»

You need to add a single environment variable to the worker container:

apiVersion: workers.spacelift.io/v1beta1
kind: WorkerPool
metadata:
  name: test-workerpool
spec:
  pod:
    workerContainer:
      env:
        - name: SPACELIFT_LAUNCHER_LOGS_TIMEOUT
          value: 3600s # This is using the golang duration format, more info here https://pkg.go.dev/time#ParseDuration

Worker communication»

Kubernetes workers receive work from Spacelift over one of two transports: MQTT (the default) or HTTP long-poll. For an overview of the two transports and why you might prefer long-poll, see Worker communication in the worker pools guide.

Unlike Docker-based workers (which use the SPACELIFT_WORKER_COMMS_* environment variables), Kubernetes pools select the transport through the WorkerPool resource. To use HTTP long-poll, set spec.commsProtocol to poll and provide spec.pollCommsURL:

apiVersion: workers.spacelift.io/v1beta1
kind: WorkerPool
metadata:
  name: test-workerpool
spec:
  poolSize: 2
  commsProtocol: poll

  pollCommsURL: https://app.spacelift.io

commsProtocol: set to poll to use HTTP long-poll instead of the default mqtt.
pollCommsURL: the base URL of the Spacelift server — the shared Spacelift domain, not your account-specific subdomain.

Use https://app.spacelift.io (not <your account name>.app.spacelift.io); for US-region accounts use https://app.us.spacelift.io.

Network configuration»

Your cluster configuration needs to be set up to allow the controller and the scheduled pods to reach the internet. This is required to listen for new jobs from the Spacelift backend and report back status and run logs.

You can find the necessary endpoints to allow in the Network Security section.

Initialization policies»

While worker-side initialization policies will work, Spacelift generally recommends approval policies instead.

Using an initialization policy is simple and requires three steps:

Create a ConfigMap containing your policy.
Attach the ConfigMap as a volume in the pod specification for your pool.
Add an environment variable to the init container, telling it where to read the policy from.

First, create your policy:

apiVersion: v1
kind: ConfigMap
metadata:
  name: test-workerpool-initialization-policy
data:
  initialization-policy.rego: |
    package spacelift

    deny["you shall not pass"] {
        false
    }

Next, create a WorkerPool definition, configuring the ConfigMap as a volume, and setting the custom env var:

apiVersion: workers.spacelift.io/v1beta1
kind: WorkerPool
metadata:
  labels:
    app.kubernetes.io/name: test-workerpool
  name: test-workerpool
spec:
  poolSize: 2
  token:
    secretKeyRef:
      name: test-workerpool
      key: token
  privateKey:
    secretKeyRef:
      name: test-workerpool
      key: privateKey
  pod:
    volumes:
      # Here's where you attach the policy to the Pod as a volume
      - name: initialization-policy
        configMap:
          name: test-workerpool-initialization-policy
    initContainer:
      volumeMounts:
        # Here's where you mount it into the init container
        - name: initialization-policy
          mountPath: "/opt/spacelift/policies/initialization"
          readOnly: true
      env:
        # And here's where you specify the path to the policy
        - name: "SPACELIFT_LAUNCHER_RUN_INITIALIZATION_POLICY"
          value: "/opt/spacelift/policies/initialization/initialization-policy.rego"

Using VCS agents with Kubernetes workers»

Using VCS Agents with Kubernetes workers involves the same approach outlined in the VCS agents section. To configure your VCS agent environment variables in a Kubernetes WorkerPool, add them to the spec.pod.initContainer.env section, like in the following example:

apiVersion: workers.spacelift.io/v1beta1
kind: WorkerPool
metadata:
  name: test-pool
spec:
  poolSize: 2
  token:
    secretKeyRef:
      name: test-pool
      key: token
  privateKey:
    secretKeyRef:
      name: test-pool
      key: privateKey
  pod:
    initContainer:
      env:
        - name: "SPACELIFT_PRIVATEVCS_MAPPING_NAME_0"
          value: "gitlab-pool"
        - name: "SPACELIFT_PRIVATEVCS_MAPPING_BASE_ENDPOINT_0"
          value: "https://gitlab.myorg.com

Controller metrics»

The worker pool controller does not expose any metrics by default.

You can set --metrics-bind-address=:8443 flag to enable them and activate the Prometheus endpoint. By default, the controller exposes metrics using HTTPS and a self-signed certificate. This endpoint is also protected using RBAC.

If you use the Helm chart to deploy the controller, you can use the built-in metrics reader role to grant access.

You may also want to use a valid certificate for production workloads. Mount your cert in the container to the following paths:

/tmp/k8s-metrics-server/serving-certs/tls.crt
/tmp/k8s-metrics-server/serving-certs/tls.key

You can also set the --metrics-secure=false flag to fully disable TLS on the metrics endpoint and ask the controller to export metrics using HTTP.

More information about metrics authentication and TLS config can be found on the kubebuilder docs.

More information about exposed metrics can be found by scraping the metrics endpoint, for example:

# HELP spacelift_workerpool_controller_worker_creation_duration_seconds Time in seconds needed to create a new worker
# TYPE spacelift_workerpool_controller_worker_creation_duration_seconds histogram
spacelift_workerpool_controller_worker_creation_duration_seconds_bucket{le="0.5"} 0
spacelift_workerpool_controller_worker_creation_duration_seconds_bucket{le="1"} 0
spacelift_workerpool_controller_worker_creation_duration_seconds_bucket{le="2"} 0
spacelift_workerpool_controller_worker_creation_duration_seconds_bucket{le="4"} 0
spacelift_workerpool_controller_worker_creation_duration_seconds_bucket{le="10"} 0
spacelift_workerpool_controller_worker_creation_duration_seconds_bucket{le="+Inf"} 0
spacelift_workerpool_controller_worker_creation_duration_seconds_sum 0
spacelift_workerpool_controller_worker_creation_duration_seconds_count 0
# HELP spacelift_workerpool_controller_worker_creation_errors_total Total number of worker creation errors
# TYPE spacelift_workerpool_controller_worker_creation_errors_total counter
spacelift_workerpool_controller_worker_creation_errors_total 0
# HELP spacelift_workerpool_controller_worker_idle_total Number of idle worker
# TYPE spacelift_workerpool_controller_worker_idle_total gauge
spacelift_workerpool_controller_worker_idle_total{pool_ulid="01JHFXXPDC6J8XM2VB0M9CS338"} 0
# HELP spacelift_workerpool_controller_worker_total Total number of workers
# TYPE spacelift_workerpool_controller_worker_total gauge
spacelift_workerpool_controller_worker_total{pool_ulid="01JHFXXPDC6J8XM2VB0M9CS338"} 2

Helm»

If you are using our Helm chart to deploy the controller, you can configure metrics by switching some boolean flags in values.yml.

Check the links in the comments about how to secure your metrics endpoint.

# The metric service will expose a metrics endpoint that can be scraped by a prometheus instance.
# This is disabled by default, enable this if you want to enable controller observability.
metricsService:
  enabled: false
  # Enabling secure will also create ClusterRole to enable authn/authz to the metrics endpoint through RBAC.
  # More details here https://book.kubebuilder.io/reference/metrics#by-using-authnauthz-enabled-by-default
  # Secure is enabled by default to be consistent with Kubebuilder defaults.
  #
  # If you want to avoid cluster roles, you can keep this set to false and configure a NetworkPolicu instead.
  # An example can be found in Kubebuilder docs here https://github.com/kubernetes-sigs/kubebuilder/blob/d063d5af162a772379a761fae5aaea8c91b877d4/docs/book/src/getting-started/testdata/project/config/network-policy/allow-metrics-traffic.yaml#L2
  secure: true
  enableHTTP2: false

Custom binaries path»

Kubernetes workers download the spacelift-worker binary, along with any tools needed for your runs and mount them into a directory called /opt/spacelift/binaries in the worker. To ensure that these tools are used, this directory is added to the start of the worker's path.

In some situations you may wish to use your own version of tools that are bundled with the runner image used for your stack. To support this, we provide a spec.pod.customBinariesPath option to allow you to customize this.

The following example shows how to configure this:

apiVersion: workers.spacelift.io/v1beta1
kind: WorkerPool
metadata:
  name: test-workerpool
spec:
  poolSize: 2
  token:
    secretKeyRef:
      name: test-workerpool
      key: token
  privateKey:
    secretKeyRef:
      name: test-workerpool
      key: privateKey
  pod:
    customBinariesPath: "/bin" # This will result in "/bin:/opt/spacelift/binaries" being added to the start of the worker's path.

Autoscaling»

Be careful when scheduling the controller in the cluster. If the worker pool controller is evicted due to autoscaling or other reasons, it may miss MQTT messages and cause temporary run failures.

Therefore, we strongly recommend deploying the controller on nodes with high stability and availability.

EKS»

For EKS Auto cluster you can set the following Karpenter annotation on the controller pod.

KubectlHelm

karpenter.sh/do-not-disrupt: "true"

--set-string controllerManager.podAnnotations."karpenter\.sh/do-not-disrupt"="true"

For Standard clusters:

KubectlHelm

cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

--set-string controllerManager.podAnnotations."cluster-autoscaler\.kubernetes\.io/safe-to-evict"="false"

GKE»

For autopilot cluster you can set the following annotation on the controller pod.

KubectlHelm

# Bear in mind that this will not 100% prevent autopilot from evicting pods.
# Please refer to autopilot documentation for more details.
autopilot.gke.io/priority: high

--set-string controllerManager.podAnnotations."autopilot\.gke\.io/priority"="high"

For Standard clusters:

KubectlHelm

cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

--set-string controllerManager.podAnnotations."cluster-autoscaler\.kubernetes\.io/safe-to-evict"="false"

AKS»

For Azure cluster you can set the following annotation on the controller pod.

KubectlHelm

cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

--set-string controllerManager.podAnnotations."cluster-autoscaler\.kubernetes\.io/safe-to-evict"="false"

FIPS»

With Go 1.24, the Go runtime has added support for FIPS mode. This allows you to run your Spacelift workerpool-controller in a FIPS 140-3-compliant manner.

Note

Note that the Go documentation mentions that FIPS mode is best effort-based and doesn't guarantee compliance with all requirements.

If you'd like to have the workerpool-controller run in FIPS mode, turn the controllerManager.enforceFips140 flag to true in the Helm chart values. We introduced this in the v0.42.0 release of the Helm chart.

HelmKubernetes

You can set this in your values.yaml file like so:

controllerManager:
  enforceFips140: true

Or pass it as a parameter: --set controllerManager.enforceFips140=true.

When deployed without helm, you'll need to set the GODEBUG=fips140=only environment variable manually on the controller container. The command to do this is:

# List all deployments to get the name of the controller deployment:
kubectl get deployments --all-namespaces

# Edit the deployment to add the environment variable:
kubectl edit deployment/<deployment-name> -n <namespace>

During the controller's startup, you should see the FIPS 140 mode {"enabled": true} message in the logs.

Note

This will only make the controller run in FIPS mode. The Spacelift worker pods are not affected by this setting as they are not compliant with FIPS 140-3 yet.

Supply custom certificates to worker pools»

You can add custom certificate authority (CA) certificates to your worker pools. We support adding them to the controller container and to the container that runs OpenTofu/Terraform.

Add certificates to controller container»

Ensure your custom certificate is pem-encoded and the file name ends in .pem.
Within the controller container, mount the certificate to /ops/spacelift/certs.

This example is for the controller Helm chart. If you're using a manifest, you will need to edit it directly.

controllerManager:
  manager:
    extraVolumeMounts:
      - name: ca-secret-volume
        mountPath: /opt/spacelift/certs
        readOnly: true
  extraVolumes:
    - name: ca-secret-volume
      secret:
        secretName: my-amazing-secret-with-something-dot-pem-inside-it

Add certificates to the OpenTofu/Terraform process»

Prepare your custom CA certificate.
- Place your CA certificate in a directory on your computer (e.g., custom-ca.pem).

Create an extended CA bundle.

Download the existing CA cert bundle from the container and append your custom certificate:

docker run -it public.ecr.aws/spacelift/runner-terraform:latest cat /etc/ssl/certs/ca-certificates.crt > new-bundle.crt && cat custom-ca.pem >> new-bundle.crt

Ensure new-bundle.crt has a newline at the end.

Create a Kubernetes secret.

kubectl create secret generic extended-ca-bundle --from-file=bundle=new-bundle.crt

Update your WorkerPool.

Delete your existing WorkerPool object (required due to immutable fields):

  kubectl delete workerpool workerpool -n spacelift-worker-pool-system

Create a new WorkerPool with the updated configuration, adjusting your poolSize and credential secret names as needed:

  apiVersion: workers.spacelift.io/v1beta1
  kind: WorkerPool
  metadata:
    name: workerpool
    namespace: spacelift-worker-pool-system
  spec:
    pod:
      volumes:
        - name: new-ca-bundle
          secret:
            secretName: extended-ca-bundle
            items:
              - key: bundle
                path: ca-certificates.crt
      workerContainer:
        volumeMounts:
          - name: new-ca-bundle
            mountPath: /etc/ssl/certs
    poolSize: 5
    privateKey:
      secretKeyRef:
        key: privateKey
        name: spacelift-worker-pool-credentials
    token:
      secretKeyRef:
        key: token
        name: spacelift-worker-pool-credentials

Scaling a pool»

To scale your WorkerPool, you can either edit the resource in Kubernetes, or use the kubectl scale command:

kubectl scale workerpools my-worker-pool --replicas=5

You can scale a Kubernetes workerpool up and down as needed (for example, using a queue depth and KEDA). This will give you more granular control over your number of provisioned workers for billing.

Billing for Kubernetes workers»

Kubernetes workers are billed based on the number of provisioned workers that you have, exactly the same as for any of our other ways of running workers. In practice, you will be billed based on the number of workers defined by the poolSize of your WorkerPool, even when those workers are idle and not processing any runs.

Migrating from Docker-in-Docker»

If you currently use Docker-in-Docker to run your worker pools, we recommend that you switch to our worker pool operator. For full details of how to install the operator and setup a worker pool, please see the installation section.

The rest of this section provides useful information to be aware of when switching over from the Docker-in-Docker approach to the operator.

Why migrate»

There are a number of improvements with the Kubernetes operator over the previous Docker-in-Docker approach, including:

The operator does not require privileged pods unlike the Docker-in-Docker approach.
The operator creates standard Kubernetes pods to handle runs. This provides advantages including Kubernetes being aware of the run workloads that are executing as well as the ability to use built-in Kubernetes functionality like service accounts and affinity.
The operator only creates pods when runs are scheduled. This means that while your workers are idle, they are not running pods that are using up resources in your cluster.
The operator can safely handle scaling down the number of workers in a pool while making sure that in-progress runs are not killed.

Deploying workers»

One major difference between Docker-in-Docker and the new operator is that the new approach only deploys the operator, and not any workers. To deploy workers you need to create WorkerPool resources after the operator has been deployed. See the section on creating a worker pool for more details.

Testing both alongside each other»

You can run both the new operator and your existing Docker-in-Docker workers. You can also connect both to the same Spacelift worker pool. This allows you to test the operator to make sure everything is working before switching over.

Customizing timeouts»

If you are currently using SPACELIFT_LAUNCHER_RUN_TIMEOUT or SPACELIFT_LAUNCHER_LOGS_TIMEOUT, please see the section on timeouts to configure timeouts with the operator.

Storage configuration»

If you are using custom storage volumes, you can configure these via the spec.pod section of the WorkerPool resource. Please see the section on volumes for more information.

Pool size»

In the Docker-in-Docker approach, the number of workers is controlled by the replicaCount value of the Chart which controls the number of replicas in the Deployment. In the operator approach, the pool size is configured by the spec.poolSize property. Please see the section on scaling for information about how to scale your pool up or down.

Troubleshooting»

Listing WorkerPools and workers»

To list all of your WorkerPools, you can use the following command:

1	`kubectl get workerpools`

To list all of your workers, use the following command:

1	`kubectl get workers`

To list the workers for a specific pool, use the following command (replace <worker-pool-id> with the ID of the pool from Spacelift):

kubectl get workers -l "workers.spacelift.io/workerpool=<worker-pool-id>"

Listing run pods»

When a run is scheduled, a new pod is created to process that run. A single worker can only process a single run at a time, making it easy to find pods by run or worker IDs.

To list the pod for a specific run, use the following command (replacing <run-id> with the ID of the run):

kubectl get pods -l "workers.spacelift.io/run-id=<run-id>"

To find the pod for a particular worker, use the following command (replacing <worker-id> with the ID of the worker):

kubectl get pods -l "workers.spacelift.io/worker=<worker-id>"

Workers not connecting to Spacelift»

If you have created a WorkerPool in Kubernetes but no workers have shown up in Spacelift, use kubectl get workerpools to view your pool:

kubectl get workerpools
NAME         DESIRED POOL SIZE   ACTUAL POOL SIZE
local-pool   2

If the actual pool size for your pool is not populated, it typically indicates an issue with your pool credentials. The first thing to do is to use kubectl describe to inspect your pool and check for any events indicating errors:

kubectl describe workerpool local-pool
Name:         local-pool
Namespace:    default
Labels:       app.kubernetes.io/name=local-pool
              workers.spacelift.io/ulid=01HPS9HDSWCQ73RPDTVAK0KK0A
Annotations:  <none>
API Version:  workers.spacelift.io/v1beta1
Kind:         WorkerPool

...

Events:
  Type     Reason                    Age              From                   Message
  ----     ------                    ----             ----                   -------
  Warning  WorkerPoolCannotRegister  7s (x2 over 7s)  workerpool-controller  Unable to register worker pool: cannot retrieve workerpool token: unable to base64 decode privateKey: illegal base64 data at input byte 4364

In the example above, we can see that the private key for the pool is invalid.

If the WorkerPool events don't provide any useful information, another option is to take a look at the logs for the controller pod using kubectl logs, for example:

kubectl logs -n spacelift-worker-controller-system spacelift-workerpool-controller-controller-manager-bd9bcb46fjdt

For example, if your token is invalid, you may find a log entry similar to the following:

cannot retrieve workerpool token: unable to base64 decode token: illegal base64 data at input byte 2580

Another common reason that can cause workers to fail to connect with Spacelift is network or firewall rules blocking the endpoint your workers use to communicate with Spacelift. Which endpoint that is depends on your worker transport:

HTTP long-poll: workers must be able to reach the Spacelift backend at your pollCommsURL.
MQTT via AWS IoT Core: workers must be able to reach the AWS IoT Core endpoint.

See our network security section for the full list of endpoints.

Run not starting»

If a run is scheduled to a worker but it gets stuck in the preparing phase for a long time, it may be caused by various issues like CPU or memory limits that are too low, or not being able to pull the stack's runner image. The best option in this scenario is to find the run pod and describe it to find out what's happening.

For example, in the following scenario, we can use kubectl get pods to discover that the run pod is stuck in ImagePullBackOff, meaning that it is unable to pull one of its container images:

$ kubectl get pods -l "workers.spacelift.io/run-id=01HPS6XB76J1JB3EHSK4AWE5AB"
NAME                                     READY   STATUS             RESTARTS   AGE
01hps6xb76j1jb3ehsk4awe5ab-preparing-2   1/2     ImagePullBackOff   0          3m2s

If we describe that pod, we can get more details about the failure:

$ kubectl describe pods -l "workers.spacelift.io/run-id=01HPS6XB76J1JB3EHSK4AWE5AB"
Name:             01hps6xb76j1jb3ehsk4awe5ab-preparing-2
Namespace:        default
Priority:         0
Service Account:  default
Node:             kind-control-plane/172.18.0.2
Start Time:       Fri, 16 Feb 2024 15:00:18 +0000
Labels:           workers.spacelift.io/run-id=01HPS6XB76J1JB3EHSK4AWE5AB
                  workers.spacelift.io/worker=01HPS6K4BNB7BPHCDHDWFAMJNV

...

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  4m23s                  default-scheduler  Successfully assigned default/01hps6xb76j1jb3ehsk4awe5ab-preparing-2 to kind-control-plane
  Normal   Pulled     4m23s                  kubelet            Container image "public.ecr.aws/spacelift/launcher:d0a81de1085a7cc4f4561a776ab74a43d4497f6c" already present on machine
  Normal   Created    4m23s                  kubelet            Created container init
  Normal   Started    4m23s                  kubelet            Started container init
  Normal   Pulled     4m15s                  kubelet            Container image "public.ecr.aws/spacelift/launcher:d0a81de1085a7cc4f4561a776ab74a43d4497f6c" already present on machine
  Normal   Created    4m15s                  kubelet            Created container launcher-grpc
  Normal   Started    4m15s                  kubelet            Started container launcher-grpc
  Normal   Pulling    3m36s (x3 over 4m15s)  kubelet            Pulling image "someone/non-existent-image:1234"
  Warning  Failed     3m35s (x3 over 4m14s)  kubelet            Failed to pull image "someone/non-existent-image:1234": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/someone/non-existent-image:1234": failed to resolve reference "docker.io/someone/non-existent-image:1234": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed
  Warning  Failed     3m35s (x3 over 4m14s)  kubelet            Error: ErrImagePull
  Normal   BackOff    2m57s (x5 over 4m13s)  kubelet            Back-off pulling image "someone/non-existent-image:1234"
  Warning  Failed     2m57s (x5 over 4m13s)  kubelet            Error: ImagePullBackOff

In this case, the problem is that the someone/non-existent-image:1234 container image cannot be pulled, meaning that the run can't start. The fix would be to add the correct authentication to allow your Kubernetes cluster to pull the image, or to adjust your stack settings to refer to the correct image if it is wrong.

Similarly, if you specify too low memory limits for one of the containers in the run pod, Kubernetes may end up killing it. You can find this out in exactly the same way:

$ kubectl get pods -l "workers.spacelift.io/run-id=01HPS85J6SRG37DG6FGNRZGHMM"
NAME                                     READY   STATUS           RESTARTS   AGE
01hps85j6srg37dg6fgnrzghmm-preparing-2   0/2     Init:OOMKilled   0          24s

$ kubectl describe pods -l "workers.spacelift.io/run-id=01HPS85J6SRG37DG6FGNRZGHMM"
Name:             01hps85j6srg37dg6fgnrzghmm-preparing-2
Namespace:        default
Priority:         0
Service Account:  default
Node:             kind-control-plane/172.18.0.2
Start Time:       Fri, 16 Feb 2024 15:22:17 +0000
Labels:           workers.spacelift.io/run-id=01HPS85J6SRG37DG6FGNRZGHMM
                  workers.spacelift.io/worker=01HPS7FRV3JJWWVJ1P9RQ7JN2N
Annotations:      <none>
Status:           Failed
IP:               10.244.0.14
IPs:
  IP:           10.244.0.14
Controlled By:  Worker/local-pool-01hps7frv3jjwwvj1p9rq7jn2n
Init Containers:
  init:
    Container ID:  containerd://567f505a638e0b42e23d275a5a1b75f40ac6b706490ada9ea7901219b54e43c8
    Image:         public.ecr.aws/spacelift-dev/launcher:2ff3b7ad1d532ca51b5b2c54ded40ad19669d379
    Image ID:      public.ecr.aws/spacelift-dev/launcher@sha256:baa99ca405f5c42cc16b5e93b5faa9467c8431c048f814e9623bdfee0bef8c4d
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/bin/spacelift-launcher
    Args:
      init
    State:          Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Fri, 16 Feb 2024 15:22:17 +0000
      Finished:     Fri, 16 Feb 2024 15:22:17 +0000

...

Getting help with run issues»

If you're having trouble understanding why a run isn't starting, is failing, or is hanging, and want to reach out for support, please include the output of the following commands (replacing the relevant IDs/names as well as specifying the namespace of your worker pool):

kubectl get pods --namespace <worker-pool-namespace> -l "workers.spacelift.io/run-id=<run-id>"
kubectl describe pods --namespace <worker-pool-namespace> -l "workers.spacelift.io/run-id=<run-id>"
kubectl logs --namespace <worker-pool-namespace> -l "workers.spacelift.io/run-id=<run-id>" --all-containers --prefix --timestamps
kubectl events --namespace <worker-pool-namespace> workers/<worker-name> -o json

Please also include your controller logs from 10 minutes before the run started. You can do this using the --since-time flag, like in the following example:

kubectl logs -n spacelift-worker-controller-system spacelift-worker-controllercontroller-manager-6f974d9b6d-kx566 --since-time="2024-04-02T09:00:00Z" --all-containers --prefix --timestamps

Custom runner images»

Please note that if you are using a custom runner image for your stack, it must include a Spacelift user with a UID of 1983. If your image does not include this user, it can cause permission issues during runs, for example while trying to write out configuration files while preparing the run.

Please see our instructions on customizing the runner image for more information.

Networking issues caused by Pod identity»

When a run is assigned to a worker, the controller creates a new pod to process that run. The pod has labels indicating the worker, run, space and stack ID, and looks something like this:

apiVersion: v1
kind: Pod
metadata:
  labels:
    workers.spacelift.io/run-id: 01HN37WC3MCNE3CY9HAHWRF06K
    workers.spacelift.io/worker: 01HN356WGGNGTXA8PHYRRKEEZ5
    workers.spacelift.io/space-id: prod-space
    workers.spacelift.io/stack-id: prod-infra
  name: 01hn37wc3mcne3cy9hahwrf06k-preparing-2
  namespace: default
spec:
  ... rest of the pod spec

Because the set of labels are unique for each run being processed, this can cause problems with systems like Cilium that use pod labels to determine the identity of each pod, leading to your runs having networking issues. If you are using a system like this, you may want to exclude the workers.spacelift.io/* labels from being used to determine network identity.