Late on Thursday the 5th, a fault in one of our UPS systems caused some service machines to lose power. A number of virtual services were affected, but the most noticeable effect was the loss of authentication services.
The problem was noticed at approximately 10pm, and full service was restored shortly after midnight.
We have taken steps to reduce the chance of this particular UPS causing problems in the short term and will be improving the resilience of the authentication service as a matter of priority, by bringing forward existing plans.
Update 9th March
The UPS failed again early on Saturday and while the authentication servers remained up this time, unfortunately the switch that connects them to the rest of the department did not, causing there to again be authentication problems for a short time.
We have now moved one authentication service to a different server which has a separate network path so hopefully this problem should not recur. The failed UPS will be replaced, as it’s now clearly not an isolated problem.