Service names in this runbook map to systemd units and git repos as follows.
Use the friendly names in conversation; the unit/repo names are what you type.
| Service | systemd unit(s) | Git repo | Host role |
|---|---|---|---|
| Backend API | voxbridge | vb-core | API/App host |
| Console | static files (served by nginx) | vb-dashboard | API/App host |
| Voice fleet | voxcore@1 … voxcore@N | vb-agents | Fleet host(s) |
| Dialler | voxdialler | (no separate Pelocal repo) | SIP/LiveKit host |
1. Health checks
Three services expose health endpoints. Check them in this order — Backend first (it is the source of truth and everything else depends on it), then the fleet, then the dialler.Backend API
The Backend servesGET /health on port 8080. The app only boots if its required
settings (MongoDB, Redis, JWT secret) are present, so a 200 here also tells you the
data layer is reachable.
Voice fleet
The fleet has two endpoints. Each worker answersGET /health for itself; the
Backend (or any caller hitting the public fleet domain through nginx) can call
GET /health/fleet to aggregate every worker over its Unix socket.
| Field | Meaning | What “healthy” looks like |
|---|---|---|
worker_calls | Live calls on this worker right now | 0 or 1 per worker (each worker takes one call) |
worker_max | Max concurrent calls per worker (MAX_CONCURRENT_CALLS) | 1 in production |
fleet_available | Free call slots across the whole host | > 0 when the host can take new calls |
Dialler
The dialler is a background asyncio worker, not a web server, but it runs a small health server onHEALTH_PORT (default 8090, bound to 127.0.0.1). It exposes
GET /health and GET /metrics.
/health reports liveness — it goes unhealthy if the tick loop has not run within
HEALTH_STALE_SECONDS (default 30). /metrics exposes the rolling pacing numbers
(answer rate, AHT, abandon rate, active dials) you use to confirm the dialler is
actually pacing and not just alive.
A systemd timer watches this for you. voxdialler-healthcheck.timer runs
check_health.sh / smoke_check.py periodically against the local health server, so
a wedged dialler is caught without anyone watching a dashboard.
2. Service management
All app services run under systemd. Restart the smallest thing that fixes the problem — never bounce the whole fleet when one worker is wedged.Voice fleet workers
The fleet is a templated systemd instance unit: one instance per worker, each on its own Unix socket (/tmp/voxcore_%i.sock). That gives you a choice between bouncing one
worker with zero downtime or restarting them all.
Dialler and Backend
Restart=always, so systemd brings it back automatically if it
crashes. Restarting it is safe: it holds no in-memory campaign state it cannot
re-derive from MongoDB on the next tick. In-flight calls already attached to fleet
workers are unaffected by a dialler restart — only new dialing pauses for a second
or two.
Logs
Follow any unit’s logs withjournalctl:
3. Rollback
There is no special rollback tooling. A rollback is just a forward deploy of the previous good release through the same Jenkins job. Because Python services useuv
and the Console is a static build, “going back” is checking out an older tag and
re-running the deploy step.
Identify the green tag
Find the last release tag that was healthy in production. If you tag every deploy
(recommended), this is the tag immediately before the current one.
Re-run the Jenkins deploy against that tag
Trigger the same deploy job you use for a normal release, but with the prior tag as
the build ref. The job does on each target host exactly what a forward deploy does:For the Console the rollback step is
npm ci && npm run build against the prior tag,
then nginx serves the regenerated dist/.Verify the rollback landed
Run the post-deploy health checks from section 1: Backend
/health, fleet /health/fleet, dialler /health. Confirm versions / behaviour
match the rolled-back tag, not the bad one.Roll back per service. The four services deploy independently, so if a bad release
only touched the Backend, roll back only
vb-core — leave the fleet and dialler
on their current good tags. Mismatched cross-service contracts are rare because new
config fields are additive, but when in doubt roll back the one service you changed.4. Scaling
Two levers: more workers on an existing fleet host, or more fleet hosts. Neither requires a code change.Add workers to a fleet host
Capacity per host = number ofvoxcore@ instances. To add workers you enable more
instances and give nginx a socket entry for each.
Add upstream socket entries in nginx
Each worker needs a
max_conns=1 socket line in the fleet upstream block. The
canonical template lives in the repo at infra/nginx/voxcore-fleet.conf.template.Reload nginx
reload (not restart) keeps existing calls alive while the new sockets come into
rotation.Add a fleet host
When a single host is maxed (CPU/RAM, not just workers), scale horizontally by adding another host. The fleet keeps no shared state, so a new host needs no coordination with existing ones.Provision and deploy the fleet to the new host
Deploy
vb-agents to the new host through the normal Jenkins job: uv sync, the
templated voxcore@ units, nginx (from infra/nginx/voxcore-fleet.conf.template),
and local MinIO for recordings.Register the host's URL with the Backend
Add the new host’s public fleet URL to the Backend’s fleet list (the dialler reads
its fleet targets from
FLEET_URLS, and the Backend selects fleets for inbound
routing). Once registered, fleet selection includes the new host automatically.5. Common incidents & escalation
Work top to bottom — the table is ordered roughly by how often each happens.| Symptom | Most likely cause | First action |
|---|---|---|
| Fleet shows no capacity but workers are idle | One or more workers stuck holding a dead call (worker_calls=1, no real audio) | Find the wedged worker and restart it: systemctl restart voxcore@<i> |
Calls stuck in leased / ringing / attaching | Dialler tick stalled, fleet unreachable, or a host that died mid-call left orphaned records | Check dialler /health + /metrics; the Backend stale-call cleanup reaps orphans, but a hung tick needs systemctl restart voxdialler |
| Dialler stopped or two diallers running | Crash without auto-restart, or a second instance was deployed (over-dials) | Confirm exactly one dialler per database; restart the canonical one, stop/disable any extra |
| Inbound SIP / DID not connecting | LiveKit SIP webhook or trunk dispatch misconfigured for that number | Check LiveKit + livekit-sip containers and the inbound dispatch config; this is config, set once per deployment |
| Recordings missing for recent calls | MinIO/object storage on the fleet host down or misconfigured | Check the fleet host’s MinIO and MINIO_* settings against the Backend’s storage config |
Fleet shows no capacity but workers are idle
/health/fleet reports fleet_available=0 while call volume is low. A worker is
holding a phantom call (the SIP leg died but the pipeline never tore down).
systemctl restart 'voxcore@*' is the last resort.
Calls stuck in leased / ringing / attaching
These are MongoDB call states the dialler drives. A pile-up means the loop stopped advancing them.- Check the dialler is ticking:
curl -s http://127.0.0.1:8090/health. If stale,systemctl restart voxdialler. - Check
/metricsforactive_dialsvs what MongoDB shows — a large gap means zombie records (dialed legs that never reported back). - The Backend’s stale-call cleanup (
scripts/cleanup_stale_calls.py, scheduled viavoxbridge-cleanup.service) reaps records pastSTALE_IN_PROGRESS_TIMEOUT_MINUTES. Confirm it is running before manually clearing anything.
Dialler stopped or duplicated
The single hardest rule on the platform:LiveKit / SIP inbound not dispatching
A DID rings but no bot ever joins. This is almost always LiveKit/SIP configuration, not the fleet.- LiveKit runs as a docker-compose stack (
livekit+livekit-sip). Confirm both containers are up:docker compose pson the SIP/LiveKit host. - Inbound SIP needs a webhook configured in LiveKit so a ringing trunk dispatches
to the fleet’s
POST /livekit/dispatch. Missing webhook = a silent, cancelled room. - The trunk’s number format must match what the carrier sends, or LiveKit rejects the leg before dispatch.
Recordings missing
Calls complete and disposition correctly but playback fails or no recording URL lands.- The fleet uploads WAVs to the object storage in its runtime config. Pelocal fleet hosts
commonly run local MinIO (
MINIO_ENDPOINT=localhost:9000, bucketrecordings). - Check MinIO is up on the fleet host and that the Backend’s storage settings match the
fleet’s
MINIO_*env exactly — a mismatch makes the Backend hand out URLs the host can’t serve. - Recent uploads failing across all calls usually means MinIO is down or the bucket changed; one missing recording is usually a single failed call, not a systemic fault.
When to escalate
Escalate to engineering (with logs from the relevant
journalctl -u <unit>) when:- A worker crash-loops after restart —
systemctl status voxcore@<i>shows repeated failures rather than one phantom call. - The Backend won’t boot (
/healthnever returns) after MongoDB and Redis are confirmed up — likely a config or migration issue. - The dialler over-dials despite a single confirmed instance — a pacing or breaker bug, not an ops problem.
- Inbound dispatch fails after LiveKit config is verified correct against a working deployment.
Quick reference
Deploy pipeline
The Bitbucket → Jenkins flow that ships every release this runbook rolls back.
Configuration & secrets
Every
.env var per service — the source for the settings referenced above.Host deployment
Host roles, systemd units, and what runs where.
Campaign troubleshooting
Operator-facing fixes for campaign-level issues — start here before paging engineering.
