Operations runbook

This is the page you open at 2 a.m. It assumes the platform is already deployed through the Bitbucket → Jenkins pipeline and that you have SSH access to the hosts. It covers checking that the system is healthy, restarting the right thing the right way, rolling back a bad release, scaling out, and triaging the failures that recur. Everything here is host-level and service-level. For campaign-level problems an operator sees in the Console (leads not dialing, low connect rate, calls dropping mid-conversation), send them to the campaign troubleshooting guide first — most “the dialler is broken” reports are actually campaign configuration.

Service names in this runbook map to systemd units and git repos as follows. Use the friendly names in conversation; the unit/repo names are what you type.

Service	systemd unit(s)	Git repo	Host role
Backend API	`voxbridge`	`vb-core`	API/App host
Console	static files (served by nginx)	`vb-dashboard`	API/App host
Voice fleet	`voxcore@1` … `voxcore@N`	`vb-agents`	Fleet host(s)
Dialler	`voxdialler`	(no separate Pelocal repo)	SIP/LiveKit host

1. Health checks

Three services expose health endpoints. Check them in this order — Backend first (it is the source of truth and everything else depends on it), then the fleet, then the dialler.

Backend API

The Backend serves GET /health on port 8080. The app only boots if its required settings (MongoDB, Redis, JWT secret) are present, so a 200 here also tells you the data layer is reachable.

# On the API/App host
curl -s http://localhost:8080/health

If this fails, nothing else matters — fix the Backend, MongoDB, or Redis before looking at the fleet or dialler. A clean response confirms config and durable storage are up.

Voice fleet

The fleet has two endpoints. Each worker answers GET /health for itself; the Backend (or any caller hitting the public fleet domain through nginx) can call GET /health/fleet to aggregate every worker over its Unix socket.

# Per-worker (hits whichever worker nginx routes you to)
curl -s https://<fleet-host>/health

# Aggregate across all workers on the host
curl -s https://<fleet-host>/health/fleet

The aggregate response is the one to read during an incident. The fields you care about:

Field	Meaning	What “healthy” looks like
`worker_calls`	Live calls on this worker right now	`0` or `1` per worker (each worker takes one call)
`worker_max`	Max concurrent calls per worker (`MAX_CONCURRENT_CALLS`)	`1` in production
`fleet_available`	Free call slots across the whole host	`> 0` when the host can take new calls

fleet_available is your live capacity gauge. If it sits at 0 during a campaign, the host is genuinely full and you should add workers or a host (see Scaling). If it reads 0 but call volume is low, you have stuck workers — jump to common incidents.

Dialler

The dialler is a background asyncio worker, not a web server, but it runs a small health server on HEALTH_PORT (default 8090, bound to 127.0.0.1). It exposes GET /health and GET /metrics.

# On the SIP/LiveKit host
curl -s http://127.0.0.1:8090/health
curl -s http://127.0.0.1:8090/metrics

/health reports liveness — it goes unhealthy if the tick loop has not run within HEALTH_STALE_SECONDS (default 30). /metrics exposes the rolling pacing numbers (answer rate, AHT, abandon rate, active dials) you use to confirm the dialler is actually pacing and not just alive. A systemd timer watches this for you. voxdialler-healthcheck.timer runs check_health.sh / smoke_check.py periodically against the local health server, so a wedged dialler is caught without anyone watching a dashboard.

# Confirm the healthcheck timer is active
systemctl status voxdialler-healthcheck.timer --no-pager
journalctl -u voxdialler-healthcheck.service -n 20 --no-pager

/health returning 200 means the loop is ticking, not that calls are being placed. A dialler can be perfectly alive and dial nothing because there are no running campaigns, the carrier circuit breaker is open, or the fleet is full. Always cross-check /metrics (active dials, abandon rate) and fleet_available before concluding the dialler is at fault.

2. Service management

All app services run under systemd. Restart the smallest thing that fixes the problem — never bounce the whole fleet when one worker is wedged.

Voice fleet workers

The fleet is a templated systemd instance unit: one instance per worker, each on its own Unix socket (/tmp/voxcore_%i.sock). That gives you a choice between bouncing one worker with zero downtime or restarting them all.

# Restarts every worker — use only when a config/.env change must apply fleet-wide.
# In-flight calls on every worker are dropped.
systemctl restart 'voxcore@*'

A single stuck worker (worker_calls=1 but the call is long dead) is the most common fleet problem. systemctl restart voxcore@<i> clears it and costs you at most that one phantom call — the rest of the host keeps running. Reserve voxcore@* for deliberate fleet-wide changes.

Dialler and Backend

# Dialler (on the SIP/LiveKit host)
systemctl restart voxdialler

# Backend API (on the API/App host)
systemctl restart voxbridge

The dialler is Restart=always, so systemd brings it back automatically if it crashes. Restarting it is safe: it holds no in-memory campaign state it cannot re-derive from MongoDB on the next tick. In-flight calls already attached to fleet workers are unaffected by a dialler restart — only new dialing pauses for a second or two.

Logs

Follow any unit’s logs with journalctl:

# Backend
journalctl -u voxbridge -f

# Dialler
journalctl -u voxdialler -f

# One fleet worker
journalctl -u voxcore@3 -f

# All fleet workers interleaved
journalctl -u 'voxcore@*' -f

When triaging a specific call, narrow to the worker that handled it (journalctl -u voxcore@<i> -f) rather than the voxcore@* firehose — with many workers the combined stream is hard to read.

3. Rollback

There is no special rollback tooling. A rollback is just a forward deploy of the previous good release through the same Jenkins job. Because Python services use uv and the Console is a static build, “going back” is checking out an older tag and re-running the deploy step.

Always keep a last-known-good tag on each repo before you ship. Tag the release you are about to deploy and know which tag you are rolling back to. Rolling back to “whatever was there before” is how you reintroduce a bug you already fixed.

Identify the green tag

Find the last release tag that was healthy in production. If you tag every deploy (recommended), this is the tag immediately before the current one.

git -C /opt/vb-agents tag --sort=-creatordate | head

Re-run the Jenkins deploy against that tag

Trigger the same deploy job you use for a normal release, but with the prior tag as the build ref. The job does on each target host exactly what a forward deploy does:

# What the deploy step runs on the host (Python service example)
git -C /opt/vb-agents fetch --tags
git -C /opt/vb-agents checkout <last-good-tag>
uv sync --frozen
systemctl restart 'voxcore@*'

For the Console the rollback step is npm ci && npm run build against the prior tag, then nginx serves the regenerated dist/.

Verify the rollback landed

Run the post-deploy health checks from section 1: Backend /health, fleet /health/fleet, dialler /health. Confirm versions / behaviour match the rolled-back tag, not the bad one.

Roll back per service. The four services deploy independently, so if a bad release only touched the Backend, roll back only vb-core — leave the fleet and dialler on their current good tags. Mismatched cross-service contracts are rare because new config fields are additive, but when in doubt roll back the one service you changed.

4. Scaling

Two levers: more workers on an existing fleet host, or more fleet hosts. Neither requires a code change.

Add workers to a fleet host

Capacity per host = number of voxcore@ instances. To add workers you enable more instances and give nginx a socket entry for each.

Add upstream socket entries in nginx

Each worker needs a max_conns=1 socket line in the fleet upstream block. The canonical template lives in the repo at infra/nginx/voxcore-fleet.conf.template.

upstream voxcore_fleet {
    least_conn;
    server unix:/tmp/voxcore_1.sock max_conns=1;
    server unix:/tmp/voxcore_2.sock max_conns=1;
    # ... add one line per new worker ...
    server unix:/tmp/voxcore_20.sock max_conns=1;
}

The fleet nginx config must use map $http_upgrade to set the Connection header. A hardcoded Connection "upgrade" breaks the HTTP POST routes and dialout returns 422. The short POST routes (/attach, /livekit/dialout, /livekit/widget) also need 429-retry to the next upstream, or a logically busy worker rejects a call other free workers could have taken. Both are baked into the template — start from it.

Enable and start the new instances

for i in $(seq 17 20); do
  systemctl enable --now "voxcore@$i"
done

Reload nginx

nginx -t && systemctl reload nginx

reload (not restart) keeps existing calls alive while the new sockets come into rotation.

Confirm new capacity

curl -s https://<fleet-host>/health/fleet

fleet_available should rise by the number of workers you added.

Add a fleet host

When a single host is maxed (CPU/RAM, not just workers), scale horizontally by adding another host. The fleet keeps no shared state, so a new host needs no coordination with existing ones.

Provision and deploy the fleet to the new host

Deploy vb-agents to the new host through the normal Jenkins job: uv sync, the templated voxcore@ units, nginx (from infra/nginx/voxcore-fleet.conf.template), and local MinIO for recordings.

Add the new host’s public fleet URL to the Backend’s fleet list (the dialler reads its fleet targets from FLEET_URLS, and the Backend selects fleets for inbound routing). Once registered, fleet selection includes the new host automatically.

If you front fleets with HAProxy, add it there too

On deployments that use an HAProxy WSS ingress in front of multiple fleets, add the new host as a backend using the repo’s infra/haproxy/add-fleet.sh, then reload HAProxy. No change is needed on the existing fleet hosts.

Adding a host is a pure configuration change — no code ships and no existing host restarts. That is the whole point of the stateless runtime plane: capacity is a list of URLs, not a deploy.

5. Common incidents & escalation

Work top to bottom — the table is ordered roughly by how often each happens.

Symptom	Most likely cause	First action
Fleet shows no capacity but workers are idle	One or more workers stuck holding a dead call (`worker_calls=1`, no real audio)	Find the wedged worker and restart it: `systemctl restart voxcore@<i>`
Calls stuck in `leased` / `ringing` / `attaching`	Dialler tick stalled, fleet unreachable, or a host that died mid-call left orphaned records	Check dialler `/health` + `/metrics`; the Backend stale-call cleanup reaps orphans, but a hung tick needs `systemctl restart voxdialler`
Dialler stopped or two diallers running	Crash without auto-restart, or a second instance was deployed (over-dials)	Confirm exactly one dialler per database; restart the canonical one, stop/disable any extra
Inbound SIP / DID not connecting	LiveKit SIP webhook or trunk dispatch misconfigured for that number	Check LiveKit + livekit-sip containers and the inbound dispatch config; this is config, set once per deployment
Recordings missing for recent calls	MinIO/object storage on the fleet host down or misconfigured	Check the fleet host’s MinIO and `MINIO_*` settings against the Backend’s storage config

Fleet shows no capacity but workers are idle

/health/fleet reports fleet_available=0 while call volume is low. A worker is holding a phantom call (the SIP leg died but the pipeline never tore down).

# Find the worker(s) reporting a call
curl -s https://<fleet-host>/health/fleet
# Restart just the stuck worker — zero downtime for the others
systemctl restart voxcore@<i>

If many workers are stuck at once, suspect the Backend or LiveKit (workers can’t finalise without them). Fix the upstream dependency first, then bounce the affected workers. systemctl restart 'voxcore@*' is the last resort.

Calls stuck in leased / ringing / attaching

These are MongoDB call states the dialler drives. A pile-up means the loop stopped advancing them.

Check the dialler is ticking: curl -s http://127.0.0.1:8090/health. If stale, systemctl restart voxdialler.
Check /metrics for active_dials vs what MongoDB shows — a large gap means zombie records (dialed legs that never reported back).
The Backend’s stale-call cleanup (scripts/cleanup_stale_calls.py, scheduled via voxbridge-cleanup.service) reaps records past STALE_IN_PROGRESS_TIMEOUT_MINUTES. Confirm it is running before manually clearing anything.

Never bulk-edit call documents in MongoDB by hand to “unstick” a campaign while the dialler is running. The dialler leases atomically; a manual flip can hand the same contact to two dials. Stop the dialler first if you must touch records directly.

Dialler stopped or duplicated

The single hardest rule on the platform:

Exactly one dialler instance per database. The Backend and Dialler share one MongoDB (MONGODB_DB=voxbridge). Two diallers on the same database each pace as if they were the only one and the campaign over-dials — blowing past concurrency limits and the carrier CPS. The dialler runs on the SIP/LiveKit host. If a second copy exists anywhere (a spare host, an old API-host deployment), it must be stopped and disabled, not just stopped.

# On the canonical SIP/LiveKit host
systemctl status voxdialler --no-pager
systemctl restart voxdialler   # if it crashed

# On any host that should NOT run a dialler
systemctl stop voxdialler && systemctl disable voxdialler

LiveKit / SIP inbound not dispatching

A DID rings but no bot ever joins. This is almost always LiveKit/SIP configuration, not the fleet.

LiveKit runs as a docker-compose stack (livekit + livekit-sip). Confirm both containers are up: docker compose ps on the SIP/LiveKit host.
Inbound SIP needs a webhook configured in LiveKit so a ringing trunk dispatches to the fleet’s POST /livekit/dispatch. Missing webhook = a silent, cancelled room.
The trunk’s number format must match what the carrier sends, or LiveKit rejects the leg before dispatch.

This is config you set once per deployment, not per call. If inbound worked yesterday and stopped today, check whether the LiveKit containers restarted or a trunk was edited.

Recordings missing

Calls complete and disposition correctly but playback fails or no recording URL lands.

The fleet uploads WAVs to the object storage in its runtime config. Pelocal fleet hosts commonly run local MinIO (MINIO_ENDPOINT=localhost:9000, bucket recordings).
Check MinIO is up on the fleet host and that the Backend’s storage settings match the fleet’s MINIO_* env exactly — a mismatch makes the Backend hand out URLs the host can’t serve.
Recent uploads failing across all calls usually means MinIO is down or the bucket changed; one missing recording is usually a single failed call, not a systemic fault.

When to escalate

Escalate to engineering (with logs from the relevant journalctl -u <unit>) when:

A worker crash-loops after restart — systemctl status voxcore@<i> shows repeated failures rather than one phantom call.
The Backend won’t boot (/health never returns) after MongoDB and Redis are confirmed up — likely a config or migration issue.
The dialler over-dials despite a single confirmed instance — a pacing or breaker bug, not an ops problem.
Inbound dispatch fails after LiveKit config is verified correct against a working deployment.

For anything an operator sees at the campaign level — leads not progressing, connect rate, AMD behaviour — start at the campaign troubleshooting guide before paging engineering.

Quick reference

# Health
curl -s http://localhost:8080/health                 # Backend (API/App host)
curl -s https://<fleet-host>/health/fleet             # Fleet aggregate
curl -s http://127.0.0.1:8090/health                  # Dialler (SIP host)
curl -s http://127.0.0.1:8090/metrics                 # Dialler pacing metrics

# Restart
systemctl restart voxcore@<i>      # one fleet worker (zero downtime)
systemctl restart 'voxcore@*'      # all fleet workers (drops all calls)
systemctl restart voxdialler       # dialler
systemctl restart voxbridge        # Backend

# Logs
journalctl -u voxbridge -f
journalctl -u voxcore@<i> -f
journalctl -u voxdialler -f

Deploy pipeline

The Bitbucket → Jenkins flow that ships every release this runbook rolls back.

Configuration & secrets

Every .env var per service — the source for the settings referenced above.

Host deployment

Host roles, systemd units, and what runs where.

Campaign troubleshooting

Operator-facing fixes for campaign-level issues — start here before paging engineering.

​1. Health checks

​Backend API

​Voice fleet

​Dialler

​2. Service management

​Voice fleet workers

​Dialler and Backend

​Logs

​3. Rollback

​4. Scaling

​Add workers to a fleet host

​Add a fleet host

​5. Common incidents & escalation

​Fleet shows no capacity but workers are idle

​Calls stuck in leased / ringing / attaching

​Dialler stopped or duplicated

​LiveKit / SIP inbound not dispatching

​Recordings missing

​When to escalate

​Quick reference

Deploy pipeline

Configuration & secrets

Host deployment

Campaign troubleshooting

1. Health checks

Backend API

Voice fleet

Dialler

2. Service management

Voice fleet workers

Dialler and Backend

Logs

3. Rollback

4. Scaling

Add workers to a fleet host

Add a fleet host

5. Common incidents & escalation

Fleet shows no capacity but workers are idle

Calls stuck in leased / ringing / attaching

Dialler stopped or duplicated

LiveKit / SIP inbound not dispatching

Recordings missing

When to escalate

Quick reference