← Back to blog
·7 min read

Production Hardening Patterns That Kill Silent Failures

A cron job sat in /root/crontab for eight days, running to completion and returning exit code 0. Its output was discarded. Its script hadn't been where it thought since the last refactor. No one noticed until a customer complained. We've seen this shape of failure across dozens of production sites—not a bug in logic, but a gap in process hygiene.

What follows are the five patterns we now enforce on every VPS deployment, lifted directly from post-mortems where silence turned into 4–8 days of unnoticed failure. They aren't novel; they are the boring ops work that prevents "unknown unknown" outages.

Directory standard: /opt/project is not negotiable

The first time we traced a missing script to a cd that depended on the cron user's home directory, we banned /root/ and /home/ for application code. Every project lives under /opt// with fixed subdirectories:

/opt/myagent/
├── code/         # version-controlled application
├── logs/         # all application output
├── locks/        # flock lock files
└── .env          # chmod 600, never in code

The .env is loaded once from a deterministic path, removing any cwd dependence. In Python:

from pathlib import Path
import dotenv

dotenv.load_dotenv(Path('/opt/myagent/.env'))

In TypeScript, a similar mindset:

import dotenv from 'dotenv';
dotenv.config({ path: '/opt/myagent/.env' });

When every process knows exactly where its configuration and artifacts live, silent path changes disappear.

Concurrency protection: flock -n is mandatory for cron

A cron job that runs longer than its interval spawns parallel instances. We saw a publisher that took 12 minutes but ran every 5—300 copies piled up, swapping the server to death. The fix is a non-blocking lock via flock:

#!/bin/bash
# /opt/myagent/code/publish.sh
LOCK_FILE="/opt/myagent/locks/publish.lock"
exec 200>"$LOCK_FILE"
flock -n 200 || exit 1
python /opt/myagent/code/publisher.py
exec 200>&-

The -n flag makes the lock attempt non-blocking; if the lock is held, the script exits immediately. No orphaned processes, no swap storms.

Log rotation must exist before the first cron entry

A VPS with a 20 GB disk filled up completely when a chatty service ran for 14 days without rotation. The application was writing 300 MB/day. Disk full means silent process death, corrupted writes, and a Saturday morning rescue. We now require a logrotate configuration as part of the deploy:

# /etc/logrotate.d/myagent
/opt/myagent/logs/*.log {
    daily
    rotate 14
    compress
    missingok
    notifempty
    size 50M
}

This is applied before the first cron job that produces logs. If left to later, it's never done.

Silent cron must become loud: failure detection

The original opening sentence isn't hyperbolic—a script can return 0 while doing nothing. Every cron job must be wrapped to alert on non-zero exit and, critically, to be detectable by external monitoring. We use a lightweight wrapper that logs output and pings a health-check endpoint:

#!/bin/bash
# /opt/myagent/code/cron_runner.sh
LOG="/opt/myagent/logs/cron_$(date +\%Y\%m\%d).log"
SCRIPT="$1"
shift

( flock -n 200 || { echo "Lock exists, exiting."; exit 0; } $SCRIPT "$@" 2>&1 STATUS=$? if [ $STATUS -ne 0 ]; then # ping a monitoring service curl -s -X POST "https://hc-ping.com/your-uuid/fail" > /dev/null echo "[FAIL] $SCRIPT exited with $STATUS" else curl -s -X POST "https://hc-ping.com/your-uuid" > /dev/null fi exit $STATUS ) >> "$LOG" 2>&1

exec 200>&-

Cron entry:

*/10 * * * * /opt/myagent/code/cron_runner.sh /opt/myagent/code/publisher.py --count 50

Now every execution is logged, locked, and monitored. Four days of silence become impossible.

Network hairpins: when UFW meets Docker

A Docker container on a Contabo VPS started correctly, port 3000 exposed, but curl http://127.0.0.1:3000 timed out. The external port was reachable only from outside the host. The cause: UFW's default DENY FORWARD chain was blocking docker-proxy traffic that hairpins from the host's loopback through iptables back to the container. It's a layer‑1 surprise that took an hour to diagnose.

We now unconditionally disable UFW on single‑container VPSes where all ports are bound to localhost and traffic comes via a reverse proxy:

ufw --force disable

If you must keep UFW, add explicit rules for the Docker bridge interface. But the real lesson isn't about UFW—it's that network instrumentation must be as explicit as the rest of the stack. An empty iptables -L saved that evening.

The same pattern—a process fails silently because its environment isn't what the author assumed—plays out in build chains. One team spent a day debugging a Turbopack build that broke on a CJS package using dynamic filesystem reads. The error was clear after enabling verbose output; the CI had swallowed it. Explicit configuration, explicit paths, and loud failures are the only defense.

Scan your site free

Ready to try AgentMinds?

Scan your site for free. No signup required.

Scan Your Site