…Or, balancing between infrastructure maintenance and project delivery, and hoping that you don’t fall off
It’s never a good sign when, whatever you do on a computer, all it will say is “I/O Error”. That’s what happened a couple of days ago with one of our older servers, and it soon became clear that things weren’t going to improve.
It was a Monday morning, and the first day of our system manager’s holiday. Leaving it until his return wasn’t an option: the server was one that co-ordinated a number of important processes every night; we had to fix things. Fortunately, there had been some work already done to migrate the processes to a newer server, so there was already a replacement available. But the work had stalled under the pressure to deliver projects with direct and tangible benefits to users – and that’s where the difficulty arises.
Preventative maintenance work such as that migration is like an insurance policy: if it isn’t done, things might go wrong, and then we’ll be in a mess. On the other hand, things might not go wrong – in which case, time spent doing the maintenance work is time that could have been spent doing other things instead. And you can’t tell which it will be: a year from now, a server might have died, or it might not. It’s like Schrödinger’s IT.
Maintenance isn’t glamorous, and people don’t necessarily even know that it’s going on. But if you’re delivering new projects, the benefits are obvious, people like them, and you’re doing what they want. Our difficulty for a long time has been making a case to allocate time for maintenance in the face of ever-increasing requirements for new things: systems, networks, servers, databases and web sites. Maintenance gets squeezed in round the edges and into the evenings, but, even more, it gets squeezed out. If it starts to look difficult, it stalls; if it needs a lot of time, it stalls; and if it needs several people to work on it at once, it may never even start at all.
And that’s where we came in: an old server was still being used because we can’t allocate the time for maintenance…and on Monday, it failed. Three or four people had to drop everything for a day-and-a-half to deal with it. We’ve managed it, of course, and almost no one will have been aware that there was a problem. Some might argue that this is a valid approach to all maintenance: after all, it forces the work to be done; it may break a logjam of debate over the best methods; and it probably gets things done more quickly than they would have been. But it does that by disrupting all other work, and – in some cases – by disrupting services, too. It causes corners to be cut so that the work can be done quickly, and it creates enormous stress for those involved – not to mention long hours. No, it’s not a valid approach.
Our challenge is to find a way to make a case for that maintenance work, and to make its benefits as clear – and quantifiable – as those of the shiny deliverables from new projects. The case has to stand up beside the cases for new projects; it has to be taken seriously; and it has to be given the time and resources that it needs.
There’s “knowledge maintenance” as well as “infrastructure maintenance” to worry about as well – less time-critical and even harder to justify. Patrick mentioned at yesterday’s meeting that he was struggling to keep up with understanding new users’ tech needs. This vac I’m trying to catch up with Python, CentOS, etc. in the hope that I can perform quick-fixes if Teaching problems arise.