Software disenchantment @ tonsky.me

Programs can’t work for years without reboots anymore. Sometimes even days are too much to ask. Random stuff happens and nobody knows why.

What’s worse, nobody has time to stop and figure out what happened. Why bother if you can always buy your way out of it. Spin another AWS instance. Restart process. Drop and restore the whole database. Write a watchdog that will restart your broken app every 20 minutes. Include same resources multiple times, zip and ship. Move fast, don’t fix.

That is not engineering. That’s just lazy programming. Engineering is understanding performance, structure, limits of what you build, deeply. Combining poorly written stuff with more poorly written stuff goes strictly against that. To progress, we need to understand what and why are we doing.

Source: Software disenchantment @ tonsky.me

About 20 years ago, I was working as a Unix sysadmin, and sat in on a meeting about moving an internally-developed application from another data center to mine. It ran on Windows, and died, literally, every day, and required a restart of the whole machine to fix. The manager in the meeting (who, I note, I recommended not be hired, and who was fired for sexual harassment just a few months later) said, “OK, we’ll just schedule it as part of maintenance tasks to preemptively reboot the machine every night.”

I literally snorted. I asked if it were not possible to, you know, actually fix the program? Find the memory leak, or whatever was the problem? I mean, it was written by us; couldn’t we get the programmer to fix their own program? The answer was, of course, no, with the added insinuation that it ridiculous that I suggest that the programmer still had work to do!

About 4 years ago, I wrote a program that helped a lot of people get their jobs done much more easily and efficiently. Per Douglas Adams, “This has made a lot of people very angry and been widely regarded as a bad move.” I was forced to hand the program over to another team, where it has run, with only one tiny patch, for 4 years now. It is not a trivial program, or architecture. To my knowledge, neither the clients nor server ever crash, or need to be restarted. I’m very proud of this.