The Silent Scheduler and Four Other VPS Configurations That Nearly Killed Our AI Pipeline
On a Tuesday morning, the silence was deafening. For the previous 96 hours, our Wiki Publisher task—a cron job that pulls from an AI ingestion pipeline and transforms raw markdown into structured knowledge articles—had generated exactly zero essays. No error logs, no crash reports, not even a failed execution status. The job ran every hour on the dot, consumed a few kilobytes of CPU, and exited cleanly. It simply forgot to bring its script. We discovered it only when a customer notified us that their 'latest updates' section had been frozen for nearly a week. That moment cost us: 14 engineering hours of root-cause analysis, 2 critical customer escalations, and a quiet sense that our production systems were far more brittle than we’d admitted.
The Silent Scheduler: When Cron Forgets Its Own Script
The crontab entry read: 0 * * * * /usr/bin/python3 --count 100. The --count argument was intended for a script that never got invoked because the command itself had no path to a Python file. Cron executed /usr/bin/python3 with the argument --count 100, which Python silently accepted (it’s a valid, albeit meaningless, flag for interactive mode) and immediately exited. Because cron sends stdout and stderr to mail by default (and we had no MTA configured to catch it), the empty output was lost. The task ran for 96 cycles, never raising a flag until a manual check of the output volume chart showed a flatline.
We traced the origin to a hasty configuration update where someone assumed the Arguments field in our scheduler frontend would append to a pre-defined script path. Instead, it was passed directly to the command line, replacing the missing file with nothing. This kind of silent failure is insidious precisely because cron’s contract is so simple: if the process exits 0, it’s considered a success. There is no built-in validation that the command actually did anything useful.
The fix required two layers. First, we corrected the crontab to include the full system path to the script and its arguments. Second, we wrapped every cron-based job with a one-line validation script that checks for the existence of critical files and exits non-zero if they’re missing, so the failure propagates to monitoring. The incident taught us that a cron job is not "configured" until it has an explicit heartbeat check—whether that’s a log metric, a sidecar health endpoint, or a simple file timestamp. Here’s the before and after:
# Broken: no script, just interpreter and argument
0 * * * * /usr/bin/python3 --count 100# Fixed: explicit full path to script and argument
0 * * * * /opt/wiki_publisher/code/publisher.py --count 100
Disk Exhaustion: The Log Files That Ate the VPS
After restoring the pipeline, output surged. The AI agent was now producing articles at full tilt, and we had turned on verbose logging to monitor the new wrapper script. Within three days, the VPS disk filled to 98%. Write operations started failing, first for the database, then for the web server temp files. The entire service stack degraded, and we received a 2 AM alert from a panicked OpsGenie integration. We logged in to find 8.2 GB of log data in /opt/wiki_publisher/logs/, with the oldest file dating back to the first run after the fix.
The root cause was a simple omission: no log rotation mechanism existed for the project. Cron-based jobs that produce logs are especially dangerous because cron will happily append forever, and a busy AI pipeline can generate megabytes per hour when debug flags are on. On a shared VPS with multiple projects, a single unrotated log can starve the others of disk space, causing cross-project outages that are hard to diagnose because the first symptom is often a "No space left on device" error in an unrelated process.
We now enforce a strict policy: before any cron job goes live, the deployment script installs a logrotate configuration in /etc/logrotate.d/. The configuration below reflects the parameters we settled on after testing: daily rotation, 14 days of retention, compression, size-based rotation as a safety net, and correct permissions so that logs remain readable by the agent’s user. The missingok and notifempty options prevent logrotate from complaining if a particular log file doesn’t exist or is empty, which is common during development.
# /etc/logrotate.d/wiki_publisher
/opt/wiki_publisher/logs/*.log {
daily
rotate 14
compress
missingok
notifempty
size 50M
create 0640 wiki_publisher wiki_publisher
}
Adding this file to the configuration management ensures that no project can fill the VPS. In practice, we also set up a global check: a systemd timer runs df -h every hour and triggers an alert if any partition exceeds 80% usage. This gives us a 20% buffer to respond before the dreaded "No space left" errors cascade.
Parallel Pandemonium: Flock as the Gatekeeper
With the pipeline healthy and logs under control, a new anomaly surfaced: duplicate articles started appearing in the knowledge base. The same content, with slight timestamp variations, would be inserted multiple times an hour. Investigation revealed that the previous run of the publisher was still in progress when the next cron trigger fired. Our Python script, which processes a batch of 100 raw markdown files, was taking anywhere from 45 to 75 minutes depending on network latency to the AI service. Under normal conditions, a 60-minute cron interval was safe, but any hiccup—a slow API response, a temporary network blip—pushed the execution time beyond the hour boundary, and a second instance launched.
This is a classic cron concurrency bug. Cron has no inherent protection against overlapping jobs; it just spawns a new process at the scheduled time. If the script is not reentrant-safe, the second instance will read the same set of files or share the same state, leading to duplicates or, worse, corrupt data. In our case, the publisher used a simple last-ID tracking file that wasn’t locked; the overlapping processes both read the same “last processed” ID and raced to write the same new articles.
The remedy is a lockfile with exclusive access semantics. We adopted flock because it’s available on virtually every Linux distribution and handles stale locks correctly when processes die. The -n flag makes it non-blocking: if the lock is already held, flock exits immediately with a non-zero exit code, and cron treats that as an error (which we can alert on). Here’s the updated crontab entry:
# Crontab with flock protection
0 * * * * /usr/bin/flock -n /opt/wiki_publisher/locks/publisher.lock /opt/wiki_publisher/code/publisher.py --count 100
We placed all lock files in a project-specific locks/ directory under /opt/ as part of our directory standard. Each lock file corresponds to exactly one cron job. This completely eliminated the duplicate issue and gave us a clear signal when a job was being skipped due to a lingering lock. Monitoring then focused on the frequency of lock skips—if a job routinely fails to acquire the lock, we know the interval needs tuning or the script needs optimization.
Docker Meets UFW: The Invisible Firewall Blockade
On the same VPS, we ran a lightweight Docker container that served an internal API used by the AI publisher. After a routine security review, we decided to harden the host firewall with UFW, setting the default policy to DROP and allowing only SSH and HTTP/HTTPS. The changes were applied at 5 PM, and by 5:01 PM, the API was unreachable—not from the outside world, as intended, but from localhost. curl 127.0.0.1:8080 would hang indefinitely, yet the Docker process showed the container running and its port mapping active.
We lost four hours of productivity chasing this phantom. The breakthrough came when we inspected iptables with iptables -L -n -v and noticed that Docker’s dynamic PREROUTING and FORWARD rules were being superseded by UFW’s ufw-user-input chain. Docker manipulates iptables directly to expose container ports; when UFW sets a global DROP policy, it inserts its rules in a way that takes precedence over Docker’s later additions. The result is that traffic to the mapped port on the host is dropped before Docker’s port forwarding can act.
# Showing how Docker and UFW chains conflict
$ iptables -L DOCKER -n -v
Chain DOCKER (2 references)
pkts bytes target prot opt in out source destination
0 0 ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.2 tcp dpt:8080$ iptables -L ufw-user-input -n -v
Chain ufw-user-input (1 references)
pkts bytes target prot opt in out source destination
0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0
The simplest fix, given our VPS was behind a cloud firewall that already restricted inbound ports to 22, was to disable UFW entirely. A more nuanced approach involves configuring UFW to allow traffic on the docker0 bridge interface or using the --iptables=false flag for Docker and relying on UFW alone, but that adds complexity and breaks Docker networking features. For a shared VPS where only SSH is publicly exposed, the overhead of managing two iptables-level tools isn’t worth the risk. We now run ufw --force disable in our provisioning script and document the cloud firewall as the definitive perimeter.
Cross-Project Contamination: The Shared VPS Minefield
Our VPS hosted three separate services: the Wiki Publisher, a Lead Hunter agent that scanned web pages for email patterns, and a Pattern Engine that analyzed network-wide signals. They all ran under the same ubuntu user because the initial setup followed the path of least resistance. That decision finally blew up when an overly enthusiastic maintenance script, written by an AI assistant given the task of "clean up stale crons," deleted a cron job belonging to the Lead Hunter. The script searched for lines containing # deprecated and removed them—unaware that a comment in the Lead Hunter’s crontab contained that exact phrase. The result: the Lead Hunter’s email discovery pipeline stopped for 29 hours, and we missed a batch of high-value leads.
The root cause wasn’t just the reckless script; it was the total lack of isolation between projects. With a single user account, any process from any project could read, modify, or delete files belonging to another. Environment variables stored in a shared .bashrc leaked across services. Crontabs were edited by multiple agents with no change tracking. We had effectively built a race to the bottom where the most aggressive automation could corrupt the others.
We fixed this by enforcing strict user separation. Each project now has its own system user and group, created during provisioning with useradd -r -s /bin/false . Crontabs are edited only via crontab -u , never by the root user. Environment variables live in project-specific .env files, readable only by the project’s user (chmod 600). The table below compares the isolation strategies we evaluated before settling on this approach.
We chose user separation because our services are lightweight Python scripts with minimal dependencies and no need for custom SELinux profiles. Docker would have added maintenance overhead (image builds, registry management) disproportionate to the risk. The crucial point is that some form of isolation is non-negotiable: even on a single VPS, treating projects as separate security boundaries prevents a misconfigured cron from becoming a VPS-wide outage.
A Standard for Sanity: /opt/ and Conventions That Save Hours
After the contamination incident, we audited the filesystem layout and found chaos: one project lived in /root/code, another in /home/ubuntu/projects/lead-hunter, and logs were scattered in /var/log/custom/, /tmp/, and even a home directory. This made it impossible to write a generic provisioning script or locate lock files when debugging. We wasted 45 minutes once searching for a logrotate config that someone had placed under /home/ubuntu/.config/logrotate/ "for testing".
We settled on a rigid convention that mirrors the /opt/ standard common in enterprise Linux environments, with the addition of a snake_case naming rule to avoid shell escaping issues. Everything a project needs at runtime is contained under one root directory: code/ for source, logs/ for log files, locks/ for flock file handles, and a .env file that holds secrets (never committed to version control). The directory tree is replicated for every project on every VPS.
/opt/wiki_publisher/
├── code/
│ └── publisher.py
├── logs/
├── locks/
└── .env # chmod 600
The .env file is sourced by a wrapper script before execution, so publisher.py never touches environment variables from the shell profile. Permissions are enforced at creation time: install -d -o wiki_publisher -g wiki_publisher -m 750 /opt/wiki_publisher/locks. This structure allows any engineer—or an AI agent—to instantly locate resources for a given project. It also simplifies monitoring: a generic df check can watch all /opt/*/logs directories, and a single cron can validate that all .env files have 600 permissions.
Adopting a standard like this removed an entire class of "mystery configuration" bugs. When a new developer joins, the answer to "where does X live?" is always "/opt/