Tuesday, July 10, 2007

Disaster Recovery Plan

Scenario

The departmental file server is unavailable due to multiple hard drive failures. The server has RAID 5 redundancy with a hot-spare drive for each volume. This allows one single drive to fail in each volume without losing access to the data. When multiple drives fail in the server, the volume goes inaccessible and unrecoverable by the RAID system.

Severity of Disaster

The departmental server provides file storage services for all department affiliated persons including staff, faculty and graduate students. The server does not store any data, which inaccessibility would cause serious interruption of departmental operations, and the server might be off-line for up to three days under some circumstances.

Effected Equipment

Two Dell PowerEdge 2800 servers with EMC CX-300 storage array unit connected. The server is located in the server room in building x room number y on rack number two.

Contact Information for Support Team

A list of names, home, cell

Preventive Measures

The departmental IT support group will make sure to check logs of the server daily for errors and to physically check the server twice a week for warning and failure lights. Any one individual hard disc failure needs to be addressed immediately to prevent from multiple hard drives to go off-line. The server is backed up daily to backup tapes by the departmental backup server.

Response Measures

  1. Loss of access to data is detected by monitoring service of the IT team or reported by users.

  2. The first available person(s) to investigate severity of the failure
    • by remotely accessing the server and checking logs and volumes

    • physically visiting the server room to see signals and warning/failure lights

    • these tasks might be done by two different support persons at the same time while keep communication by cell phones

  3. Person that discovered or responded to the failure first will contact other team members depending on severity of the problem.

  4. Support person will inform departmental key personnel about the failure (Chairman, Head Administrator and Computer Committee Chair).

  5. Notification email will be sent out to all department members to keep them informed and up-to-date on the issue. Also, clients should be notified about any time periods when the server will be unavailable. Throughout the recovery process, clients should be kept up-to-date of the process and expected time when server will be back at normal operations.

  6. If the team is available (on-site) a meeting will be organized to discuss the situation and assign tasks.

  7. After a thorough assessment of the situation, Dell will be contacted for support assistance.

  8. Support team will troubleshoot and fix hardware problems.

  9. Support team will recreate the volume that failed and was lost.

  10. Restore data from backup to the effected volume.

  11. Double-checking and re-adjustment of security settings on the volumes.

  12. Notifying clients that the server storage is back online, and ask them to test access and report any problems.

  13. After recovery evaluation of procedures and success. Record incident and all of its details and store it with other IT documents.

  14. Report about the incident to departmental key personnel.


Expected Timeline

The response measures steps and recovery of one lost volume should be accomplished within two days of the failure. The troubleshooting of the hardware problem and recovery should be finished twelve hours from the delivery of the replacement parts from Dell (which should be within four hours of support call).

1 comment:

Unirecovery said...

After recovery evaluation of procedures and success. Record incident and all of its details and store it with other IT documents.
Report about the incident to departmental key personnel.


It is important for every company to have a data recovery plan as it might help more often that we might think.

http://www.unirecovery.co.uk