A long time ago, a web server I look after (divf.eng.cam.ac.uk) would regularly get into a (irretrievably) wedged state, and a while later people would complain.
It was unobvious what was causing this, and disruptive that it was happening. So as a quick fix, I wrote a script that ran every hour on the xen host whose guest it was, which attempted to fetch http://divf.eng.cam.ac.uk/, using wget with a 5s timeout, and if that failed, ran the xm command to send Magic SysRq syncs followed by Magic SysRq reboot to the guest.
Fast forward many months, and the “This is Division F” page that used to be at that URL was no longer the official “This is Division F” page (which was now in the control of a more central office, on a more central web server).
So, to avoid having a dangling, unmaintained and increasingly out-of-date page that might cause confusion, I made http://divf.eng.cam.ac.uk be a redirect to the new “official” page.
Of course, the wget command I used now always failed because it was designed to only succeed if it directly fetched a page, and regarded a redirect as a failure.
I just noticed that divf.eng.cam.ac.uk has been rebooting every hour for quite some time.
[Trivial fix – the script now tries to fetch http://divf.eng.cam.ac.uk/robots.txt ]
The interesting question is how to record dependencies like this so that one is reminded of them when one makes apparently entirely unrelated changes to systems months or years later. And to do so in such a way that the resulting “change management” framework doesn’t utterly stifle creativity, and indeed cause a crippling inability to do anything at all.
(preferable to have a web server rebooting every hour (which no-one has noticed, because it only takes a couple of seconds to reboot), than nothing at all)
John Sloan says
My solution to similar problems is to have restart scripts of this nature be sufficiently noisy when they do things (i.e. cause the cron to generate an email) makes a change of behaviour from occasional to clearly every hour a lot more noticeable, at the cost of a little more spam under normal conditions.
Patrick Gosling says
A getting there from here problem – first I have to reduce the noise of all the other cron jobs so that this kind of thing stands out. [ Not that I’m saying that would be a bad thing … ]