September 29, 2016

Top Five Causes For Why Servers Fail & Tips For Responding

When your server goes down, the first obstacle lies in discovering what went wrong. Having your server up and running again as soon as possible is mission-critical, so we’ve put together the top 5 reasons servers fail- and what you can do to minimize the risks for next time. If you’re in the middle of a crisis, give us a call at 770.441.2520 and we’ll work with you to pinpoint the problem.

Disk Failure

Not even server hard drives last forever, and when one disk fails, entire systems can stop working. It’s not always certain how long your drives are going to last, but their failure rates experience a sharp spike after about four years. The median lifespan of a drive is just over six years, but it’s good to be aware and keep an eye on your drives and their performance around the four-year mark.

The way a drive is being used can also affect the lifespan. The more use the drive sees, the more likely it is to degrade faster. A drive used on a server used primarily for storage purposes, for example, will likely have a longer lifespan than one used on a production server.

Minimize this risk by performing regular disk checks, and have backups waiting in the wings for when one does fail.

Virus Attack

This is one of the most common failure issues with servers. A security breach could potentially lead to corruption or even loss of data in addition to system downtime. If you host websites on your server, they can be flagged by browsers and search engines as unsafe- preventing current and potential customers from even visiting your site.

Luckily, you can prevent viruses by following security best practices and having a reliable business-class antivirus solution and firewall on your server. You should also enforce security measures. Not every user needs to be an administrator and not every administrator needs full system access. Monitor who has what access level and update access levels whenever there are changes in the company, such as an employee leaving.

Educate your employees on security as well. Let them know how to tell what URLs can be dangerous and what attachments may contain viruses.

Failed Upgrades/Updates

Upgrades and updates are double-edged swords. Bad ones can cause frustrating errors and even server failure. But if you don’t update, then you leave your server vulnerable.

How do you solve this conundrum? Always make sure you have good, up-to-date backups, but also consider waiting a little bit before updating/upgrading. Our system administrator recommends waiting about four days to a week to give time for bugs in updates to be discovered and fixed before you install them on your servers. That way you don’t have to deal with the growing pains.

For upgrades, wait a bit longer and make sure you run a full backup right before you upgrade to avoid losing any data. Upgrades are larger and incorporate more changes, meaning there’s more of a chance for something to go wrong.

To avoid waiting too long, you can automatically update small things and manually update the larger ones. Changes to browsers? No big deal. Change to your CRM? Much bigger deal.

Poor Documentation

Poor documentation is a major contributor to server failure. Even if your issue is small, you need to be able to log in and actually fix it. If something stops working, and you don’t remember the proper credentials to login or know where the credentials are kept, then you can’t get access to troubleshoot or correct it.

Keep a proper record of all your credentials in a secure place. If you don’t feel the need to own the credentials yourself, make sure you know where they are kept. You don’t want employees other than your system administrator accessing the back end of your server on a whim, but you need to have someone available with the information should your system administrator leave or become unavailable during a crisis.

If you outsource this, make sure that the company provides you with up-to-date documentation. Should you want to switch companies or start handling this in-house, you don’t want them to be able to hold your documentation hostage.

Physical Disaster

Your server room floods. There’s an office fire. A tornado comes through your building. It’s difficult to account for acts of nature and other physical disasters. These events can cause anything from slight damage to a server to complete destruction of it.

That’s why you should always have backups running. Backups are not a set-up-and-forget thing either. You should be checking regularly to make sure your backups are running. Many companies have backups stored away somewhere without anyone checking on them, and are unpleasantly surprised when the need to use them arises and find they haven’t been backing up data for months. Not keeping good backups can cost companies months of data in the case of a server failure.

You can also protect the data on your server by running additional backups off-site. That way your backups won’t be affected in the same physical event that damaged your server.

About the Author- Eric Henderson is Rocket IT’s virtual Chief Information Officer. He is also the tallest person at Rocket IT (by a fraction of an inch).

Have you found that you need the expertise of a Chief Information Officer to help you make strategic decisions on how to leverage technology to meet your unique business goals, but aren’t ready to commit to hiring a full-time executive to fill that need? Learn about our virtual CIO services.

Related Posts