Friday, 14 September 2012

The perfect power backup for servers

So recently we had an issue where our servers were all unavailable on a Sunday morning, we could only assume a powercut had taken place or one of the circuits had tripped.

To answer the posts subject, "The perfect power backup for server" would of course be an inline generator, effectively the power goes off, batteries take care of the providing power for a few seconds whilst the generator kicks in. Of course for most of us our IT systems do not warrant this sort of expense, you tend to find generators are used in data centres and large companies that host their own servers.

That said there is lots we can do to make sure that we look after our servers correctly.

So when a power outage occurs in my opinion you need a UPS (Uninterruptable Power Supply) that will do the following and in this order;
  1. Power your servers until the battery only has enough juice to shutdown the servers. Obviously if the power is restored by such time then great! I would say that 10 minute or so on battery is more than adequate. If the power is off for 10 minutes you can assume its going to be a substancial powercut.
  2. Once the battery only has enough juice to power down the servers gracefully the UPS should issue the shutdown command to all the servers that it is connected to.
  3. The UPS should then wait a period of time and then turn off the output to the servers connected to the UPS. Now many UPS's have multiple output ports so you can setup these outputs to turn off after different amounts of time etc..
  4. Once power is restored the UPS should wait a period of time to restore power to the output ports to make sure the power isn't flapping. I would say that 60 seconds of continuous power is probably enough to safely say that power has been fully restored.
  5. Once power is restored the servers should automatically boot back up again and thus leave it in a usable state.
  6. You need to be notified of whats happening every step of the way. No point having problems and finding out about them first thing Monday morning or when an irate manager calls you.
Many of our clients just have a couple of servers and one UPS, the setup for these is pretty simple and usually consists of a server having a link to the UPS via USB and having some control software installed. In Fusion's comms room we have around 10 servers, 2 POE Switches, 4 routers & one firewall. To support this we have 3 UPS's each with a network management card installed, this means that we don't have to rely on the server been up at look at what going on and to configure the UPS.

Even as I write this more and more scenarios come into my mind, lets take a look at some of them and how we have tackled them;
  • The first thing is to make sure that your UPS is working correctly. I have seen it time and time again, UPS's with red flashing lights usually mean that something is wrong. Do you know that if you let your UPS drain more than 30% you will almost definitely cause long term damage to it? One of the things you need to ensure is that your UPS is never allowed to drain fully, we tend to configure them so that they shutdown the output when they reach 30% no matter whats still running. You can even put a multimeter across the live and neutral points, find teh voltage and work out if the battery is faultly, search for lead battery condition and all will be revealed. Looking after batteries is a fine art in itself.
  • Its worth doing a running test of a UPS to check that things are functioning correctly. I have heard of people using electric heaters to do this! If its not functioning correctly replace the battery. Did you know that most batteries depending on how they have been treated will only last between 3 & 5 years!
  • You need to make sure that you test the run time with the maximum possible load going through the UPS, always work on the worst case scenario. You need to be careful of overloading the UPS.
  • Are you using the right plugs on your UPS with the right fuses in? Kettle leads can quite often have either 5 or 10 amp fuses in. Some UPS's can easily pull more than 5 amps!
  • If the servers do shutdown the UPS should only power them back up when the power supply is stable again and when the UPS(s) have enough power to start them up and then shut them back down should another power failure occur.
  • What about if your servers have redundant power supplies? Well this again opens another can of worms. Do you just plug both the PSU's into the same UPS? Surely this is a single point of failure? Do you plug one PSU into the UPS and one into the wall socket? Do you have multiple power sources / multiple ways coming into the room? In a ideal scenario you would have two UPS's that talk to each other and each UPS into a different power supply / way and plug PS1 into one UPS and PS2 into the other... The only thing is UPS's are not cheap! You have also got to make sure that your switch is powered by a UPS otherwise if your servers loose comms to the UPS the UPS will be unable to issue the shutdown command once the runtime threshold has hit.
  • Lets just say you do plug one PS into the UPS and the other into the wall (obviously using a surge protector) what happens if the power goes on the UPS side? Well eventually the UPS will shutdown these servers even though they have a perfectly good power supply coming from the wall socket.
  • The way you configure servers to start automatically when power is restored is to enable the setting on the BIOS to power on with AC connect (please bear in mind that this setting is usually disabled by default). Basically if the servers shutdown via the UPS and they still have a good power supply from the other PSU to servers will never switch on again because effectively they never lost power. Once way around this I guess is using some kind of management card for the server that allows you to power it on remotely. Note: You can enable this setting on Dell servers with Open Manage installed by using the omconfig command line, search the web for more details there is plenty of documentation on this.
  • What happens if the power goes off on both power sources the UPS eventually shutdown the servers, but as the servers are shutting down power is restored. I guess in this scenario you are in a similar position to my last bullet point.
  • What about yours comms kit? Most comms kit doesn't have support for UPS communications! It is this sort of kit that will drain your battery to its last few percent.
So back to the problem that we had. It was quite simple really. We have got 3 UPS's, one for all our comms kit, and the other two to share the server load. The one running the comms kit is a Dell UPS Tower 1920W HV, and the other two are a APC Smart-UPS 2200 RM & a APC Smart-UPS 1500 XLM, as I have said before they have all got network management cards in. So the servers were split over the two UPS's and the redundant PS were all connected to a power block which was then connected directly to the Dell UPS. So the series of events that occurred were that the power was on and off all Sunday morning between 7 & 8 culminating in at least one of the APC's transferring all its load to the Dell UPS and around 10.7 AMP's going though something that was only ever designed to take a maximum of 8 AMP's! This blew the fuse (see below) to the UPS supplying all our comms kit and at this point most of the servers and it wasn't long before there was silence in the server room... Our engineer Ryan was dispatched later that afternoon to get things up and running again. So we have learnt a lot from this incident and have now put in measures so this will not happen again and so that we have better visibility of whats going on.


We still have the three same UPS's, one running the comms kit which will do so for about a hour until the battery is about 30%, obviously if we do loose power for a long period of time the management interface will always be up as this uses very little power. The servers are still split across the other two UPS's but now the redundant power supplies are setup to goto a APC AP7920 Switched Rack PDU, this also has a management interface so providing that we have comms (one of our 4 links coming into the building) we can always get onto all the UPS's and PDU. We have setup the PDU to wait 15 minutes if there is a disruption to its power source, this should mean that if the servers do start to shutdown and power gets restored mid shutdown, they should still come back up when power is restored to the UPS or 15 minutes after power is restored to the PDU. This means we can easly reboot all the servers remotely by powering them on and off at the mains, that said I don't like doing this as its not good for servers and defies the whole point of a graceful shutdown. We have also had a dedicated 20 AMP way installed into the server room that is used for powering the UPS's, this should ensure that another device in the building will not knock out our server room!

To ensure we are kept upto date with what is happening during a power failure we now SMS critical errors out of hours and also emailed using an external mail provider (bear in mind there is no point using your internal SMTP server to let you know that the power has been restored if the UPS's have shut it down).

We have spent lots of time researching and implementing this new solution, to be honest we don't have many power problems in our office and it will probably not happen again for another 10 years. But we have learn lots and I personally have enjoyed making our server room as resilient as possible with a limited budget. The other way to look at this is if we hadn't spent the time looking at our backup power solution now then maybe it would have caused us more problems in the future, what I mean by this is that we were lucky on that fateful Sunday morning, if the fuse hadn't of done its job there could have been a fire, the servers were OK, but an ungraceful shutdown could have corrupted data and caused us days of downtime in recovering our systems. End of the day lessons have been learnt and the world is now a better place for our increased in knowledge in the area.

Are you sure that your server room is fully covered? Give us a ring on 08451221240 for a free server room assessment.

No comments:

Post a Comment