← Back to blog

Production Deployment Patterns That Prevent Cascading Failures on a Shared VPS

ByAgentMinds Intelligence·Published ·14 min read·Source
vpsdeploymentproductionpatternslinuxcrondatabasedockerpython

The monitoring dashboard went green across the board, but the data export pipeline had been dead for over a week. No alerts fired, no error logs appeared—just an empty /mnt/data partition that should have held 14GB of compressed CSV snaps. When a customer finally asked why their reports hadn’t updated in eight days, we found a single crontab entry pointing at a script that had been exiting with code 1 since the moment it was deployed. The root cause was a missing logrotate configuration that had let the application log grow to 48GB, filling the root partition. Cron attempted to start the export job, hit ENOSPC, and died silently because nobody had wired MAILTO or a wrapper script that trapped exit codes. The eight-day gap cost us a mid-tier enterprise account and four engineer-days of scramble. Since then, we’ve re‑architected every deployment pipeline on our Contabo VPS around a set of patterns that treat silent failure as the Default state and enforce visibility at each step.

When Cron Goes Silent: The Eight-Day Cascade

The export job wasn’t complex—a Python script that queried Postgres, wrote Parquet files, and synced them to an S3-compatible bucket. The crontab entry was standard: 0 2 * * * /opt/export_processor/run.sh. Under normal conditions, the script logged to /opt/export_processor/logs/exporter.log, and because the application log had its own rotation, we assumed everything was fine. But /opt/export_processor lived on the root filesystem, and a separate integration service had been writing 2MB/minute of debug logs into /var/log/integration/app.log for three months without rotation. When the root partition hit 100%, the export script’s open() call failed, and Python’s default FileNotFoundError handler didn’t match the OSError for disk full. The script exited with a non‑zero code, but cron, configured without MAILTO, simply discarded the exit status.

How We Finally Detected It

We only found the failure because the customer filed a support ticket. After unlocking the partition by truncating the runaway log, we immediately built a cron‑wrapper that wraps every cron job in a 20‑line bash script. That wrapper captures stdout and stderr to a timestamped file, checks the exit code, and sends a webhook to a Slack channel if anything isn’t zero. Importantly, the wrapper runs *before* the actual command, so even a job that cannot start due to disk full still writes an error record. The wrapper pattern is now mandatory for any project under /opt.

#!/bin/bash
# /opt/<project>/locks/cron_wrapper.sh
set -euo pipefail
PROJECT="$1"
CMD="$2"
LOG_DIR="/opt/$PROJECT/logs"
LOCK_FILE="/opt/$PROJECT/locks/cron_${PROJECT}.lock"
WEBHOOK_URL="https://hooks.slack.com/services/T..."

mkdir -p "$LOG_DIR" TIMESTAMP=$(date +%Y%m%d_%H%M%S) LOG_FILE="$LOG_DIR/cron_${PROJECT}_${TIMESTAMP}.log"

# Atomic non‑blocking lock flock -n "$LOCK_FILE" bash -c "$CMD" > "$LOG_FILE" 2>&1 EXIT_CODE=$?

if [ $EXIT_CODE -ne 0 ]; then echo "Cron job $PROJECT failed with exit $EXIT_CODE" >> "$LOG_FILE" curl -X POST -H 'Content-type: application/json' --data "{\"text\":\"CRON FAIL: $PROJECT exit $EXIT_CODE\"}" "$WEBHOOK_URL" fi

exit $EXIT_CODE

This wrapper delivers three benefits: every execution produces a named log file, conflicting runs are prevented by flock, and failures push a notification within seconds. The cost is a single extra line in the crontab: 0 2 * * * /opt/export_processor/locks/cron_wrapper.sh export_processor "/opt/export_processor/run.sh". We’ve since caught five disk‑full events before they impacted customers.

Disk Fills at 3 AM: Why Log Rotation Is Not Optional

In a shared VPS environment where multiple projects coexist, disk is the cheapest resource to provision but the most expensive to exhaust silently. The integration service that caused our outage was a Node.js app whose bunyan logger wrote to a single file. The developer had intended to add logrotate but considered it “operational overhead” and deferred it. When the root volume filled, systemd‑journald stopped accepting logs, crond couldn’t write to its spool, and even SSH became sluggish because /tmp was on the same partition. The entire machine degraded, yet no external monitor noticed because our health‑check endpoint still returned 200—it was served from an in‑memory cache.

The Minimal Viable logrotate Configuration

For any project that writes to local files, we enforce a logrotate snippet that compresses old logs and limits total size. The configuration is placed in /etc/logrotate.d/ and requires no cron from the developer because logrotate is driven by a systemd timer or a global crond on most distributions. Our standard pattern rotates daily, keeps 14 days, compresses with gzip, and hard‑caps the file size at 50 MB before rotation is triggered unconditionally.

# /etc/logrotate.d/integration_service
/var/log/integration/app.log {
    daily
    rotate 14
    compress
    delaycompress
    size 50M
    missingok
    notifempty
    create 640 integration integration
}

delaycompress ensures the most recent rotated file stays readable without decompression for a day, which helps debugging. size 50M acts as a safety net even if the application emits a burst faster than daily rotation can handle. After deploying this across all projects, disk‑full events from local logs dropped to zero. The overhead is negligible—gzip compression adds 2‑3% CPU sporadically—and the alternative is a production outage that takes down every co‑resident service.

When Logrotate Alone Isn’t Enough

Logrotate solves the file‑size problem, but not the visibility problem. If rotation fails (e.g., because of a permission change), logrotate will write to /var/log/messages but won’t wake anyone up. We therefore pair it with a basic logwatch summary that emails yesterday’s logrotate output, giving us daily confirmation that rotation actually ran. For teams that centralise logs, the decision points differ. The table below summarises when to rely purely on logrotate versus shipping to a collector.

ScenarioUse logrotateShip to central collector |---------------------------------------------|------------------------|---------------------------| Single VPS, < 5 projectsYes, with email summaryOptional Multi‑VM clusterAs fallback onlyRequired Disk‑sensitive (e.g., 10 GB root volume)Yes, aggressive sizesYes, with short retention High‑frequency debug logs (> 10 MB/min)No—rotate won’t keep upYes, use ring buffer or stream Compliance (PCI, SOC2)NoYes, with tamper‑proof storage

Concurrency Collisions: The Case for Advisory Locks

After fixing the disk problems, a new failure mode emerged: the export job started taking 45 minutes instead of the usual 12, and during that window, cron would launch a second instance. The two processes competed for the same Postgres read pool and S3 upload slots, causing both to time out. The symptom was a slowly growing queue of frozen connections on the database, eventually triggering asyncpg.exceptions.ConnectionDoesNotExistError. We needed a way to ensure only one instance of the export pipeline ran at a time, without relying on a distributed lock manager that would add infrastructure.

Advisory file locks—specifically flock with the -n flag—solved this with zero external dependencies. The VPS filesystem already guarantees atomic creation of lock files, so flock -n /path/to/lock returns immediately with exit code 1 if another process holds the lock. This is precisely the behaviour we want for cron: if the previous run is still active, skip this invocation and log a warning.

#!/bin/bash
# Inside the actual job script /opt/data_export/export.sh
LOCKFILE="/opt/data_export/locks/export.lock"
exec 200>"$LOCKFILE"
flock -n 200 || {
    echo "[$(date)] Export already running, exiting" >&2
    exit 0
}
# ... long‑running export logic ...
flock -u 200

The Subtle Danger of Non‑Blocking Locks

While flock -n prevents overlapping runs, it introduces a new decision: what happens when the job is skipped? If the job is a daily report, skipping a cycle may be acceptable. If it’s a streaming pipeline that must eventually catch up, skipping can create a widening gap. In one case, a financial reconciliation job kept being skipped for three consecutive nights because upstream ETL was slow. The gap went unnoticed because the exit code 0 meant the wrapper’s Slack webhook never fired. We now differentiate: jobs that must not be skipped use flock without -n, which makes the second invocation block until the first finishes. That blocking behaviour is paired with a timeout at the wrapper level to avoid silently piling up cron processes.

# Blocking variant with timeout for mandatory jobs
timeout 3600 flock "$LOCKFILE" bash -c "$CMD"

The timeout is critical: without it, a hung job would consume all cron capacity (some crons have a maximum number of simultaneous jobs). After we shipped the lock pattern, we audited all cron definitions and found four jobs that had been racing for months, consuming 6 GB of extra RAM during overlap. The lock pattern eliminated those phantom costs immediately.

Database Connections in Freefall: Pool Exhaustion and Fallback Chains

A separate incident taught us that local fixes can cascade across services. Our Python API uses asyncpg with a connection pool of 20. One evening, the Gemini API endpoint we rely on for document classification started returning 503 errors, which our client library retried up to five times with exponential backoff. Each retry held an open database connection while waiting, because the request handler opened a transaction, called Gemini synchronously inside it, and only committed after the external call returned. At peak, 65 seconds per request, twenty concurrent requests all waiting on Gemini, all holding connections—the pool hit zero available connections, and every subsequent request threw asyncpg.exceptions.TooManyConnectionsError. The API began returning 500s, and our health‑check, which also used the database, went red.

Building a Fallback Chain

The fix wasn’t to increase the pool size—that merely delays the exhaustion. Instead, we restructured the pipeline so that no external network call holds a database transaction. The Python service now uses a three‑tier fallback: (1) local cache (Redis) if the classification already exists, (2) Gemini API with a 5‑second timeout, (3) a local ML model served via ONNX runtime. Only the local model path writes to the database, and it does so in a dedicated, short‑lived transaction.

import asyncpg
import asyncio

async def classify_document(pool, doc_id: str): cache_key = f"cls:{doc_id}" cached = await redis.get(cache_key) if cached: return cached

try: result = await asyncio.wait_for(gemini_classify(doc_id), timeout=5.0) except asyncio.TimeoutError: result = await local_onnx_classify(doc_id)

async with pool.acquire() as conn: async with conn.transaction(): await conn.execute( "INSERT INTO classifications (doc_id, label) VALUES ($1, $2)", doc_id, result ) await redis.set(cache_key, result, ex=3600) return result

The acquire() call here never occurs until the external work is done, so connections are held only for the duration of the insert. We also added a circuit breaker that disables the Gemini path for 120 seconds after three consecutive 503s, relying entirely on the local model. The 0.0007 USD end‑to‑end cost per query we achieved with this hybrid routing—87% of queries answered by cache or local model—is a story in itself, but the operational win was instant: zero connection pool exhaustion incidents since deployment.

Sizing the Connection Pool Correctly

Pool sizing is a frequent source of cargo‑culting. The table below captures what we’ve seen work on a single VPS with 4 vCPUs and 8 GB RAM.

Workload profileRecommended pool sizeRationale |------------------------------------------|-----------------------|--------------------------------------------------------| Short‑lived reads (< 5 ms)20–30Postgres can handle hundreds of concurrent short reads | Mixed read/write, external API calls | 5–10 | Keep connections free for writes; isolate external calls| Heavy batch writes2–5Use executemany and connection‑pinning | Connection‑per‑request (no pool) | N/A | Avoid at all costs on VPS; pooler like PgBouncer is minimum|

We run PgBouncer in transaction‑pooling mode in front of Postgres, which multiplexes client connections. That configuration alone reduced connection churn by 90%.

The Structure That Survives: Project Layout on Shared VPS

When you have six Python APIs, four Node services, and a handful of shell‑driven pipelines, keeping them in /root/ or scattered across /home/ becomes a magnet for accidents. We standardised on /opt// after losing a production .env file because a developer ran rm -rf /home/ubuntu/old_project without realising the current project symlinked there. The discipline is simple: every project lives under /opt, with a consistent structure that makes permissions, logs, and locks predictable.

/opt/
├── data_export/
│   ├── code/               # Git clone or tarball
│   ├── logs/               # All log files
│   ├── locks/              # flock lock files
│   ├── .env                # chmod 600, owned by the service user
│   └── requirements.txt
├── api_gateway/
│   ├── code/
│   ├── logs/
│   └── .env
└── ...

The .env file holds secrets and environment variables. It is sourced by systemd unit files or cron wrappers, never committed to Git. We enforce chmod 600 .env through a pre‑deploy check script that runs before any new code goes live.

Avoiding Docker and UFW Conflicts in This Layout

One subtlety of /opt on a VPS running Docker is that Docker’s iptables rules can bypass or conflict with UFW. We discovered this when a containerised service bound to 0.0.0.0:8080 was reachable from the internet despite UFW denying all incoming except SSH. Docker inserts its own FORWARD rules that take precedence over UFW’s INPUT chain. Because our project layout puts the docker-compose.yml inside /opt//code/, we now also include a 30-docker overrides file for UFW that re‑applies restrictions. On a dedicated host with no public‑facing services other than SSH, the simplest and safest solution is to disable UFW entirely and rely on Docker’s --iptables=false or on a cloud firewall. On Contabo, however, we kept UFW but added ufw allow from 10.0.0.0/8 for internal Docker networks. Debugging why curl 127.0.0.1:8080 timed out took four hours, stracing docker‑proxy, until we traced the iptables chains and saw UFW dropping the redirected packets.

iptables vs. UFW: When Docker Proxies Get Blocked

The exact chain of events is worth detailing because it’s a classic example of tools that silently undermine each other. Docker’s default --iptables=true modifies the PREROUTING and FORWARD chains to enable container networking. UFW, meanwhile, flushes and re‑applies its rules on boot, and in its default configuration sets the FORWARD policy to DROP. When a request arrives on the host’s public IP port 8080, the kernel routes it through PREROUTING (Docker’s DNAT rule), into FORWARD, where UFW drops it. The request never reaches INPUT, so UFW’s own allow 8080 rule is irrelevant. The symptom is a curl to the host’s own public IP hangs, but curl localhost:8080 works fine because localhost bypasses the FORWARD chain. This inconsistency wasted half a shift.

The Surgical Fix

We now explicitly manage iptables with a small init script that runs after Docker starts. The script ensures Docker’s user‑defined chains are inserted before UFW’s drop rule. The alternative—disabling Docker’s iptables integration and manually routing ports—is more predictable but requires every container to use host networking, which loses isolation.

#!/bin/bash
# /opt/iptables_fix/fix_docker_ufw.sh
# Run as part of boot sequence, after docker.service
sleep 10  # let Docker settle
# Allow forwarding from Docker bridge to host and back
iptables -I FORWARD 1 -i docker0 -o docker0 -j ACCEPT
iptables -I FORWARD 2 -i docker0 ! -o docker0 -j ACCEPT

We also set --iptables=false for any container that does not need outbound public access, falling back to network_mode: host for those that serve only localhost clients. The table below summarises the trade‑offs we evaluated.

ApproachIsolationComplexityRisk of silent breakage |---------------------------------------|-----------|------------|------------------------| UFW + Docker default iptablesMediumLowHigh—forwarding chain mismatch Disable UFW, rely on cloud firewallLowLowLow if cloud firewall is active Docker with --iptables=false + manual rulesHighHighMedium—manual rules can drift Host networking for all containersNoneLowestLow—no iptables surprises Cilium or similar eBPF replacementHighHighLow—if maintained correctly

For a small VPS running under ten containers, disabling UFW and depending on the provider’s firewall (Contabo’s is adequate) has proven the pragmatic choice. We lost no measurable security because the host has no public‑facing services beyond SSH, which is key‑only.

Everything Hangs on a Wrong ID: Foreign Key Violations and Connection Draining

A final pattern emerges from the interaction between application logic and connection pooling. In one of our Node‑based microservices, a batch import endpoint accepted a JSON array of customer IDs and UPSERTed them into a customers table. If any ID in the array referenced a non‑existent company_id in a foreign table, Postgres would abort the entire transaction with a foreign‑key violation. The Node service used pg pool with statement_timeout=30s. Because the violation happened in a transaction, the connection was left in an aborted state (current transaction is aborted, commands ignored until end of transaction block). The pool, unaware of the transaction state, handed that same connection to the next client request, which failed with the same cryptic error. After three such failures, the pool drained and the service hung.

Session vs. Transaction Pooling and the FK Race

The root cause was a mix of session‑based pooling (the default for pg-pool) and an ORM that did not issue ROLLBACK on error before releasing the connection. The fix was twofold: switch to transaction‑level pooling in PgBouncer, and wrap every database call in the application with a try/catch that explicitly performs ROLLBACK.

// Node.js with pg, explicit rollback wrapper
const executeWithRollback = async (pool, fn) => {
  const client = await pool.connect();
  try {
    await client.query('BEGIN');
    const result = await fn(client);
    await client.query('COMMIT');
    return result;
  } catch (e) {
    await client.query('ROLLBACK');
    throw e;
  } finally {
    client.release();
  }
};

Why SERIALIZABLE Is Overkill Here

Some engineers reach for SERIALIZABLE isolation to avoid these issues, but on a single VPS with mixed workloads, the retry overhead from serialisation failures often exceeds the cost of disciplined use of READ COMMITTED. The real insight is that connection pool draining is a symptom of an application that does not treat a connection as a finite, stateful resource. Every path that acquires a connection must release it in a well‑known state—either committed or rolled back. We now run integration tests that intentionally trigger FK violations and verify that the pool still has healthy connections afterward.

# Python integration test ensuring pool health after FK dump
def test_pool_health_after_fk_violation(pool):
    async with pool.acquire() as conn:
        with pytest.raises(asyncpg.ForeignKeyViolationError):
            await conn.execute("INSERT INTO orders (customer_id) VALUES (99999)")
    # Pool should still have free connections
    async with pool.acquire() as conn:
        assert await conn.fetchval("SELECT 1") == 1

This test has prevented three separate deployments that would have leaked connections silently.

Scan your site free

Ready to try AgentMinds?

Scan your site for free. No signup required.

Scan Your Site