Job Queue Recovery Runbook
This runbook describes the minimal 1.0 operator workflow for background job visibility and recovery in self-hosted Publaryn deployments.
Access requirements
- a platform administrator account
- a JWT session or API token carrying the
audit:readscope
The queue visibility endpoint is:
GET /v1/admin/jobsRecovery action endpoints are:
POST /v1/admin/jobs/recover-stale
POST /v1/admin/jobs/{job_id}/retrySupported query parameters:
state=pending|running|completed|failed|deadkind=scan_artifact|index_package|deliver_webhook|cleanup_expired_tokens|cleanup_oci_blobs|reindex_searchpage=<n>per_page=<n>
What to check first
- Inspect the queue summary:
summary.by_statussummary.by_kindsummary.oldest_pending_age_minutessummary.stale_jobs_count
- Confirm whether the queue is blocked in one state or one job kind.
- Review
jobs[*].last_error,locked_by,locked_until, andattemptsfor the affected jobs. - Use
jobs[*].is_stale,jobs[*].can_retry, andjobs[*].recovery_hintto decide whether an API-level recovery action is safe.
Typical checks
All pending jobs
GET /v1/admin/jobs?state=pendingUse this when publication, search, or cleanup work appears delayed.
Stale running jobs
GET /v1/admin/jobs?state=runningThen inspect summary.stale_jobs_count.
If stale jobs are present, compare locked_until with current time and confirm that the corresponding worker instance is no longer healthy before intervening.
One queue family only
GET /v1/admin/jobs?kind=cleanup_oci_blobs
GET /v1/admin/jobs?kind=scan_artifact
GET /v1/admin/jobs?kind=reindex_searchUse job-kind filters to separate publish-path failures from maintenance-path failures.
Recovery guidance
Pending backlog is growing
- verify API and worker processes are both running
- check PostgreSQL health and connection saturation
- confirm Redis availability if you expect Redis-backed features in the deployment
- look for repeated
deadjobs that indicate a systematic handler failure
Stale running jobs are reported
Publaryn workers already attempt stale-job recovery during their periodic queue sweeps. If summary.stale_jobs_count remains non-zero for multiple recovery intervals:
- confirm the worker process responsible for the stale lock is gone or hung
- restart or replace the worker deployment
- call
POST /v1/admin/jobs/recover-staleto reset abandoned running locks topending - recheck
GET /v1/admin/jobsto confirm the jobs return topendingand are claimed again
Dead-lettered jobs accumulate
- inspect
last_errorto identify whether the failure is data-specific or systemic - correct the underlying storage, database, or handler issue first
- only then call
POST /v1/admin/jobs/{job_id}/retryfor failed or dead jobs that are safe to replay
The retry endpoint preserves last_error for diagnosis, resets the job to pending, clears stale lock/completion fields, and resets attempts so normal worker backoff can start again from a clean operator-initiated replay.
Notes
GET /v1/statscomplements this runbook with a public top-leveljob_queue_pendingcounter for quick smoke checks.- Operator recovery actions are audited with
admin_job_retryandadmin_jobs_recover_staleaudit events. - Broad abuse, takedown, and full operator-console workflows remain outside the 1.1.0 recovery scope.