The single VPS this site runs on was still on Debian 11. It was fine. "Fine" is how a box gets to be four years behind, so I spent a Saturday moving it to Debian 12 (bookworm). I chose a clean reinstall over an in-place dist-upgrade, because I wanted the machine's state to be something I could reproduce from a file rather than something I'd accreted by hand. Notes follow, mostly so the next time I do this I don't have to remember any of it.
Reproducible, or it didn't happen
The rule I gave myself: if a piece of config only exists on the running box, it doesn't exist. Everything goes in a repo first. Before touching the server I had:
- The Caddyfile and the static site, version-controlled.
- A short provisioning script — user, firewall, packages, unattended-upgrades — that I could run on a fresh image.
- A
pg_dumpof every database, plus a tested restore on my laptop. An untested backup is a rumor.
Then it's just: spin a new bookworm instance next to the old one, run the script, restore the data, flip DNS, watch, destroy the old box a week later once I trust it. No upgrade-in-place anxiety, and the old box stays as a rollback.
The things that bit me
It went mostly to plan. Three things didn't.
Postgres major version. Bullseye shipped PG 13, bookworm ships PG 15. A plain file copy of the data directory does not survive a major version bump — you migrate via pg_dump/pg_restore (which I'd done) or pg_upgrade. I'd planned for this, but it's the classic way to lose an afternoon if you assume the data dir is portable. It isn't.
The firewall default flipped under me. I moved from a hand-rolled iptables script to nftables, which is the bookworm default, and got the ordering wrong so that my allow rules sat below a catch-all drop. SSH held because the existing connection was already established (conntrack), but a fresh connection would've been refused. I caught it by opening a second terminal before reloading rules — the oldest trick, and the only reason I'm not writing this from a recovery console.
# always have an escape hatch when editing firewall rules
ssh box 'sleep 600 && nft flush ruleset' &
# then go edit. if you lock yourself out, the timer saves you.
Caddy and a stale ACME account. Fresh box, fresh Caddy data dir, so it went to re-issue certs and briefly tripped Let's Encrypt rate limits because I'd been testing the cutover against the production endpoint instead of staging. Lesson re-learned: point at the ACME staging directory while you rehearse, switch to prod only for the real flip.
What I kept from the old box
Nothing, on purpose. Every byte that mattered came from a repo or a dump. The home directory's accumulated cruft — half-finished scripts, a tmp/ I was scared to delete — got left to die with the old instance. That's the real payoff of treating a reinstall as the migration path: it's a forcing function for throwing things away.
The box now runs Caddy, one Postgres, and the litequeue worker, under systemd, with unattended-upgrades on so I'm not four years behind again. The colophon has the full rundown of what's on it. Total downtime at the DNS flip was under a minute. Most of the Saturday was the rehearsal; the cutover was boring, which is the goal.
← back to writing