The potential for service outages is something that keeps many a system administrator awake at night. While Ambien may be able to combat stress-induced insomnia, a better cure for sleepless nights may come in the form of implementing best practices for recovering from a services outage. That said, outages come in many shapes and sizes, with some attributable to connectivity or other issues that may be beyond the control of even the best system administrator.

However, there are still plenty of service and data outages that can be addressed quickly, such as those caused by failed hard drives, corrupt software, malware, and end-user errors, that can lead to data loss or system failures. Combating those issues and getting systems operational again can be a challenging process, unless you have the proper procedures and products in place. This includes, but is not limited to, having  good, reliable backup.

Here are 10 best practices to help you get started.

  1. Have a recovery plan. A written plan to prepare for emergencies is a must. All staff should know the plan and take ownership of both their day-to-day role and their role in an emergency. Establish different layers of redundancy as well as what to do when data loss happens. Review the plan regularly to keep it up to date.
  2. Use multiple storage methods. Using multiple methods of data storage provides flexibility and redundancy. The cloud can offer free protection from on-site risks and allows for quick recovery of information. Include image-based backups as well as files/folders backups as part of the backup process. Tests backups to make sure they can restore correctly and make virtual machines available on your network to serve as backups, which can bring processes back online in minutes, not hours.
  3. Capitalize on automation. Automation can provide constant protection that avoids gaps in data backups. Automated processes remove forgotten backups from the equation and can eliminate the oversights that sometimes occur with manual processes. Correctly configured automated backup will complete each step before going on, and it will warn administrators of failures.
  4. Incorporate snapshots. Frequent backups can be accomplished using snapshot technology, which can capture changes to data every few minutes, allowing administrators to recover systems from just a few minutes before a crash.
  5. Ease of use. Select backup and restore products that are easy to implement and easy to use. Simplifying the backup process goes a long way toward ensuring backups are accomplished and the same can be said for restoration.
  6. Monitor backups. Monitoring and reporting is one of the best ways to determine that backups are functioning as intended and helps ensure the information needed for recovery is readily available and well understood.
  7. Validate database and OS recovery tools. Some restorations are not as straightforward as others, with problems caused by corrupted boot sectors or damaged indexes. Make sure you have the proper tools to deal with those problems on hand during a recovery.
  8. Audit regularly. As part of the backup process, regular system audits should be conducted to ascertain changes to the infrastructure and ensure all critical systems are backed up as well as uncover shadow IT systems that also need to be backed up.
  9. Consider replication. By replicating systems, applications, and data in real time, recovery can become a simple matter of switching over to replicated systems in case of a primary system failure.
  10. Simulate failures. Simulation of a system failure can prepare administrators for when/if an actual failure occurs and helps to identify the unknowns that normally impact an untested restoration process.

With a little bit of planning and the adoption of the proper technology, restoring systems and services can be a chore that is much less stressful and can guarantee that operations are brought up quickly and effectively.

And, perhaps best of all, it makes it possible for system administrators to sleep at night!