When a long-running server needs to be rebooted, what are the most important tools? Remember, reboots on many systems can be weeks, months, or even years in between. So a reboot is not a normal occurence for the machine.
So what would the best tools to have on hand? Paper and pen. Take extensive notes of everything that happens out of the ordinary as the system comes up – things to fix, things to watch out for, and so on. Recording how much time it takes may not be a bad idea. Watch for services that are not required and shut them down as needed.
When debugging the reboot process, make sure to get evidence of a completely clean startup before considering the job done. The job may look like it is done, but if a reboot exposes a failure in configuration or other problems, then it’s not done – and you won’t know unless you reboot.
Also when you reboot, make sure that all subsystems are up and running. Often, important subsystems are not set to automatically start up – in case the system crashes, the idea is to keep the system off-line until the reason for its demise is fully known. So don’t forget these important subsystems and start them up after booting – whether the system is Caché or Oracle or some other.
4 thoughts on “System Reboots Require These Tools and Practices”
As usual, these are excellent tips and advices, but, what about when the servers that you administer are available only thru the network (dedicated colos, etc)?
Most, if not all the Linux servers I work with are available only thru SSH.
Do you have any advice for this? (other than a (expensive terminal server))
And as usual, thanks for screen.
I didn’t write GNU screen, I just extol its virtues…..
I don’t see that any of the tips I’ve given matter whether the server is local or remote; the access is the same, and the reboot process is the same.
Last night I rebooted two servers from about 60 miles away. Most UNIX servers will have a secondary management port, either network-based or serial-based or both. The management port allows the administrator to access the system, shutdown the server entirely (including power off and power on) without losing any connectivity to the port itself.
This is what I used to reboot those two servers.
Also, you don’t need an expensive terminal server; GNU screen will work just fine.
Oops! My bad, I’m really sorry, I read “author”, but missed “book”, I thought you were the author of screen.
Regarding the remote reboot, there is something I don’t quite understand.
You say that you use screen (talk about it…) to log via serial port and watch the whole process just like if you were in front of the server.
That is if you have another *nix server configured to be able to connect to the server you are rebooting, right?
If you don’t have a spare box, you do need terminal server, or another similar solution to log to the server that it is been rebooted, right?
Yes and no. In my case, I connected to a serial terminal concentrator, and then to a serial based console.
However, that is actually the exception: the servers come with management consoles such as HP’s Integrated Lights Out (iLo) or their Management Processor (MP). HP’s iLo offers access via HTTPS, TELNET, and SSH; the MP is only TELNET and SSH. Both come with the server, offer power off and power on capabilities, and so forth.
Sun systems have something similar; I’m sure IBM PowerPC servers do as well.
Thus, you do not need a terminal concentrator; the main purpose of such a thing was to make the serial ports available over the network – but with these systems, they are available over the network directly.
On a previous HP-9000 system with serial console only, I was able to add a hardware product from HP called HP Secure Web Console which did the same thing. (We’ll ignore the fact that HP’s Secure Web Console….. wasn’t secure.)