← Back to blog
·12 min read

The Gemini Key Expired at 14:03. Here's What the Next Six Hours Taught Us About Silent Failures.

postmortemmigrationmonitoringdeepseekcase-study

The Gemini API key expired at 14:03 Istanbul time on a Wednesday, and we found out about it exactly the way you would expect from a monitoring system that hasn't been stress-tested: we didn't. The next scheduled blog generation cron at 08:15 Thursday morning ran, hit a 400 response from Google's endpoint (API key expired. Please renew the API key.), swallowed the exception, returned {"status": "error", "reason": "gemini generation"}, and moved on. No alert fired. No email was sent. The log file on the VPS grew by 273 bytes and nothing visible changed.

This post is about the six hours between realizing that and having the whole LLM pipeline migrated to DeepSeek V4 Pro running on a stricter monitoring stack. It's not a glamorous migration story — there was no downtime to fix, no customer ticket. But the way we discovered it, debugged it, and shipped the replacement is exactly the kind of iteration AgentMinds is built to accelerate, and we want to show the actual sequence.

What monitoring caught, and what it missed

Our freshness_agent runs every 6 hours on the VPS and checks three things: when did the pipeline last finish, when was the last blog post committed, when did wiki-lint last run. If any of those crosses a threshold (18h, 48h, 10 days respectively), the agent raises a critical warning and sends an SMTP alert through alert_email.py with a 60-minute cooldown key.

At 14:03 Wednesday, we manually asked: does the Gemini key still work? It didn't. But here's the thing: the blog freshness threshold is 48 hours. The Monday morning cron had succeeded (pre-expiry). The Tuesday morning cron had succeeded. It would take until Friday 08:15 — more than 48 hours after the last successful post — for freshness_agent to trigger. We had a two-day blind spot between "cron runs but silently errors" and "monitoring notices."

This is the exact category of silent failure our platform is built to expose in other people's code. Finding it in our own was uncomfortable and honestly useful. We now cache the last LLM success timestamp in the scan state and flag when a cron succeeds but produces no artifact — a meta-pattern we'll be rolling out to connected sites next.

Choosing the replacement model

We considered three paths:

OptionLatencyJSON reliabilityCostChosen |---|---|---|---|---| Rotate Gemini key~1sNative responseSchemaSameNo Claude 3.5 Sonnet2-4sExcellent5x higherNo DeepSeek V4 Pro120-200sNative response_formatLowestYes

The first option was tempting because it was a 30-second fix. We rejected it because our single-provider dependency had just cost us a monitoring gap; doubling down on the same vendor seemed like confirmation bias. Claude was rejected on cost — we run LLM calls across a dozen endpoints and a six-hour pipeline, and the unit economics of the collective-intelligence model assume a cheap reasoning model.

DeepSeek V4 Pro was interesting because it's a reasoning model — similar in shape to Gemini 3.1 Pro Preview or OpenAI's o-series — but with a specific quirk we had to design around: the v3 deepseek-reasoner endpoint did not support response_format=json_object. Setting that parameter caused the API to put all output into reasoning_content and leave content empty. V3 Chat worked fine for JSON; V3 Reasoner did not. V4 (Flash and Pro) fixed this. V4 Pro correctly populates content alongside reasoning_content when json_object mode is set.

We tested both variants against the actual daily-blog prompt:

  • V4 Flash: 21 seconds, 1,530 output tokens (254 reasoning), valid JSON
  • V4 Pro: 156 seconds, 5,980 output tokens (4,261 reasoning), valid JSON, noticeably higher prose quality
  • We picked V4 Pro as the sole model. The 156-second latency is real — it doesn't fit inside a 30-second HTTP request. Our /api/v1/cron/daily-blog endpoint already runs under a VPS cron curl with a --max-time 300 ceiling, which we raised to 420s for the new model's long-reasoning path.

    The one-client rule

    Before the migration, four separate code paths in our repo called LLMs directly:

  • api/app.py daily-blog — direct call_gemini with manual JSON parsing
  • core/base_agent.py _call_ai — direct urllib.request to Gemini v1beta endpoint
  • sync/analyzer.py deep_analysis — another direct Gemini HTTP call
  • knowledge_engine/llm.py LLMClient — parallel Gemini implementation with its own retry logic
  • Four call sites, four different timeout values, four slightly different retry policies. During the migration we collapsed all of them onto a single client (wiki/gemini_client.py) with backward-compatible names (call_gemini, call_gemini_json) so imports didn't break. The client has exactly one knob that matters: the model constant at the top of the file.

    We wrote a Claude Code subagent to enforce this — if any future commit adds a direct requests.post to api.deepseek.com or imports the OpenAI SDK, the reviewer blocks the change. The rule is codified in .claude/agents/llm-call-reviewer.md and committed to the repo, so every developer and every agent working on the codebase sees the same constraint.

    This matters because the temptation during a migration is to leave one fallback path "just in case." Every fallback path becomes a divergence point later. Our user directive after the migration was explicit: only V4 Pro, nothing else. No Flash fallback. No Gemini legacy path. If V4 Pro returns a malformed response or times out, we raise an exception. We do not silently downgrade.

    Is this brittle? Yes. That's the point. A silent downgrade is a silent failure, and silent failures are what we got hired to surface. Loud failures, even inconvenient ones, give you data you can act on.

    Shipping under a broken build we didn't know about

    Mid-migration, a separate bug bit us. We pushed the V4 client, the deploy went green, the next scheduled daily-blog cron ran and committed a new post. The post showed up in posts.ts on main. The frontend deploy triggered on Render. It failed with a TypeScript error we didn't expect:

    Type error: Type '({ slug: string; ... } | undefined)[]' is not assignable to type 'BlogPost[]'.
      Type '... | undefined' is not assignable to type 'BlogPost'.
    

    Four consecutive frontend deploys had been failing silently for roughly 26 hours. The API deploys were green, because the API doesn't build the blog page. The web service's rootDir: web setting meant commits that only touched master-agent-system/** never triggered a rebuild. So for a day and a half, new blog posts were landing in the repo but never on the site. The blog index on agentminds.dev/blog kept showing 7 posts when the source file had 9.

    The root cause was embarrassing once we found it: our daily-blog endpoint's splice logic was appending a redundant comma. When adding a new entry to the posts array, it would run:

    new_src = posts_src.rstrip()[:-2].rstrip() + ",\n" + new_entry + "\n"
    

    The ,\n was inserted between the existing last entry (which already ends with ,) and the new entry. The result was },, — two commas, which TypeScript reads as an undefined array element. The fix is one line:

    trimmed = posts_src.rstrip()[:-2].rstrip()
    if trimmed.endswith("}"):
        trimmed += ","
    new_src = trimmed + "\n" + new_entry + "\n"
    

    We added a Python unit test that runs the splice against five synthetic inputs (happy path, missing trailing comma, empty array, whitespace, multiple entries) and asserts no ,, appears in the output. That test now runs on every endpoint-reviewer invocation.

    The broader lesson: when a build failure doesn't affect the user-facing symptom you're watching, you can go days without noticing. We now have a separate agentminds-deploy skill that monitors both the API and Web Render services after any push and raises if either goes build_failed. It's wired into the Claude Code workflow we use to ship, so skipping it isn't an option.

    What we changed about how we work

    The Gemini expiry, the silent build failures, and the model migration all happened inside six hours. Looking back, the sequence looks clean in a retrospective — "here is what we did, here is what we learned" — but in the moment, each step exposed a gap we hadn't budgeted for.

    Three things are different now:

    The monitoring-the-monitor problem. We added a dead-man's switch: if freshness_agent itself fails to report within 12 hours, a separate check on the API side notices. Before this, our monitors only monitored the product. Now they monitor each other. We'll write this up as a reusable pattern in the knowledge pool.

    The build-health vs service-health distinction. We were checking whether the API responded 200 OK. We were not checking whether recent GitHub commits successfully deployed to the web frontend. Those are different questions. The new deploy skill polls both Render services on every push and reports build status.

    Single-provider fragility. We're still on a single LLM vendor for good reasons (cost, latency, JSON reliability), but we've separated the concerns: the client interface exposes call_gemini / call_gemini_json names regardless of the underlying model, so swapping providers next time is an env-var change, not a 46-file refactor.

    We didn't lose any customer-visible uptime. But we lost a day and a half of blog freshness that nothing in our system flagged, and we came very close to losing the daily-blog cron entirely if the freshness threshold had caught it later. That's not an acceptable margin, and this post exists partly as forcing-function: writing the sequence down made us commit to the fixes instead of making a mental note and moving on.

    If you run a site with scheduled background work — crons, workers, pipelines — the specific vulnerabilities we hit are probably present in your setup too. The generic fix ("monitor your monitors") is easy to agree with and hard to implement well. The specific fix is harder: decide exactly which silent-failure mode you're most exposed to, instrument for that one, and make the alert path loud enough that it survives your own cooldown logic.

    Scan your site free — we'll tell you which silent-failure patterns already live in your infrastructure, and which ones other sites in the network have already solved.

    Ready to try AgentMinds?

    Scan your site for free. No signup required.

    Scan Your Site