Observability»

This guide provides recommendations for monitoring your Self-Hosted Spacelift installation to ensure it's running correctly. Proper monitoring helps identify potential issues before they impact your operations and ensures the reliability of your Spacelift infrastructure.

Metrics to Monitor»

Core Services»

The following table shows the core metrics that you should monitor for each of your Spacelift services:

Service	Metric	Description
Server	CPU usage	Processor utilization
	Memory usage	RAM consumption
Load balancer	Response time	Time to process API requests
	Error rate	Percentage of 5xx responses
Scheduler	CPU usage	Processor utilization
	Memory usage	RAM consumption
Drain	CPU usage	Processor utilization
	Memory usage	RAM consumption
Database	CPU usage	Processor utilization
	Memory usage	RAM consumption
	Connection count	Active DB connections

Message queues»

The Drain service uses a number of different message queues to perform asynchronous processing of certain operations. The main metric you should monitor for the message queues is the queue length. When the drain is operating correctly, messages should be processed very quickly and you should not expect to see large backlogs (hundreds of messages) for long periods of time.

One caveat to this is the webhooks queue. Because webhooks processing involves making lots of requests to your source control system, they can sometimes take several minutes to process. It is not unusual to see small backlogs on the webhooks queue, or messages that take several minutes to process. This is ok as long as the messages are eventually being processed and the queue length is not constantly increasing.

The message queue length is easy to monitor for SQS-based message queues - you can use the ApproximateNumberOfMessagesVisible metric provided by SQS.

For the postgres-based message queue however, you will need telemetry enabled. When telemetry is enabled, we expose the following metrics:

postgres_queue.messages.sent (counter) - incremented when a message is sent to the queue.
postgres_queue.messages.received (counter) - incremented when a message is received from the queue.
postgres_queue.messages.changed_visibility (counter) - incremented when a message visibility is changed.
postgres_queue.messages.deleted (counter) - incremented when a message is deleted from the queue.
postgres_queue.messages.total (gauge) - total number of messages in the queue.
postgres_queue.messages.visible (gauge) - number of visible messages in the queue.

Worker Pool Controller (Kubernetes)»

For Kubernetes worker pool deployments, you can monitor the worker pool controller using Prometheus metrics. These metrics are available in the spacelift_workerpool_controller namespace. See the Controller metrics section for more details.

Metric	Description
`spacelift_workerpool_controller_run_startup_duration_seconds` (histogram)	Time between when a job assignment is received and the worker container is started
`spacelift_workerpool_controller_worker_creation_errors_total` (counter)	Total number of worker creation errors
`spacelift_workerpool_controller_worker_idle_total` (gauge)	Number of idle workers
`spacelift_workerpool_controller_worker_total` (gauge)	Total number of workers

Telemetry»

Telemetry and tracing can help diagnose complex issues but is not required for basic monitoring. If you decide to implement tracing:

Configure an appropriate backend (Datadog, AWS X-Ray, or OpenTelemetry).
Focus on high-value traces (API requests, run execution, etc.).
Use sampling in production to reduce overhead.

Refer to the Telemetry reference for configuration options.

Logging»

Setting up proper log collection is strongly recommended - it’s a key part of running a healthy self-hosted installation. Without it, identifying and fixing issues becomes much harder and more time-consuming.

The sections below outline what logs are available and how to collect them across the different components of your Spacelift setup.

Core services»

All 3 core services (server, scheduler, and drain) log to stdout and stderr. We at Spacelift primarily use traces for debugging, so you won't find many "info" level logs. On the other hand, errors and terminal failures will be present.

Docker-based worker pools»

Our Docker-based worker pools log to /var/log/spacelift/error|info.log files.

Note that in case of a startup failure, the worker will terminate immediately so you won't have a chance to see the logs. We provide an option to not terminate on failure for the below two types of deployments:

Cloudformation - the worker pool deployment stack has a PowerOffOnError variable. If set to false, the worker pool will not terminate on startup failure.
terraform-aws-spacelift-workerpool-on-ec2 Terraform module - this module has a selfhosted_configuration variable that must be provided for self-hosted installations. The variable has an embedded power_off_on_error field.

Kubernetes-based worker pools»

The Kubernetes-based worker pools log to stdout and stderr. The documentation has a dedicated section on troubleshooting that provides more details on how to retrieve logs. You can use any Kubernetes log collection tool (e.g., Fluentd, Fluent Bit, Loki) to collect and aggregate these logs.