Online training ~50 employees Web services / DevOps

Training site that crashed 8,668 times in 3 months: 5 distinct causes found, zero downtime since

La sfida

A training site with recurring crashes: nearly 9,000 server restarts in three months, five and a half hours of total downtime, pages going unreachable with no warning for students who were mid-lesson. The team had tried patching the problem with random fixes, but it kept coming back. Nobody could reproduce it consistently — it appeared and disappeared, hit some users and not others.

La soluzione

Systematic layer-by-layer analysis, with a documented rollback plan for every change: no permanent modification before verifying its impact. Five distinct problems were found and resolved — problems that were amplifying each other: one component consumed all available memory on every page opened by a logged-in user; a disabled security plugin kept running in the background blocking other functions; the cache system was configured in a way that made it rarely work; two more problems hidden in the server configuration and database. After the fix: pages load in 0.41 seconds, zero crashes.

Risultati

💥

Memory per page: from crash to 30MB

📉

8,668 crashes → 0 since

Pages -24% faster as a side effect

🔍

5 distinct causes identified and resolved, not masked

Stack tecnico

  • PHP-FPM 8.3 + OPcache JIT
  • MariaDB (buffer pool, slow query log)
  • Redis (LRU eviction policy)
  • Nginx (proxy cache, gzip)
  • WordPress mu-plugin custom
  • Plesk + Linux

It wasn’t one problem — it was five layered on top of each other

The symptom was always the same: intermittent crash, no obvious cause. The reason was five distinct problems triggering each other. One filled memory, full memory slowed everything down, slowdown stacked requests in a queue, the queue triggered another crash. Fixing just one would have eased the symptom without removing the cause — and the problem would have come back.

Diagnosis first, fix after

Every change was applied in isolation, with a rollback plan ready, and the impact verified before moving on. No patches applied and hoped for the best. The result is a system that works for documented reasons — not luck.

The crashes were the tip of the iceberg

The most visible problem was hiding inefficiencies that slowed the site down even in normal conditions. Once the bugs were resolved, the site became 24% faster even under standard load. The cache system, which had been barely functioning due to a configuration error, now correctly serves all visitors — that contributes to the speed too.

Got a similar challenge?

Book a call