IT4K12 – Disaster Recovery
These are my notes from various sessions at IT4K12 in Vancouver Nov 17-18, 2016. They may be messy, and there may be mistakes, and it may not be exactly what the presenter wanted to be remembered, but it’s what resonated with me. – Todd
Gregg Ferrie, Director of Information Technology
Gregg reminded us of Murphy’s Law, where anything that can go wrong, will go wrong. Gregg does worry about what happens if we don’t have good DR. Gregg had piles of great examples and ideas about things to consider for a good DR and BC plan.
What kinds of disasters do we worry about? Well, on the island certainly we worry about Natural Disasters like Fire, Flood, Earthquake. But also Human Disasters: Loss of Internet, Sabotage, Theft, Extended Power Failure, Hack Attack, Terrorist Attack.
There have been a number of cases recently where we’ve had problems. U of C ransomware, Kings College – London lost a single RAID on October 17th, October 31st were still not up and running. 29,000+ students, 9,000 staff and 1 backup server.
Disaster Recovery is a lot like insurance. You buy it every year, you pay premiums, you complain about the cost, but when you need it, you’re glad it’s there.
Delta Airlines, small fire in their data centre, let to hundreds of millions of $’s lost. Thousands of cancelled flights and lost future revenue.
Questions to ask yourself:
- If your critical information systems became unavailable due to some catastrophic event? eg fire or flood in the data centre with complete loss.
- If the School Board Office and the data centre burned to the ground one Friday evening just before Christmas break?
- It has been said that “Failing to plan, is planning to fail.”
Saanichton – old elementary school building, with small data centre, etc.
Reasons Why You Should Prepare
- Because your district auditors strongly recommend it as a good business practice
- Because the Government of BCs CIO office requires it
- Because the architecture of the NGN could put you out of commission for up to 45 days and beyond at all sites
- Jobs might depend on it – including yours
- Because if you “fail to prepare you are preparing to fail”
What is Disaster Recovery?
- DR is a set of policies and procedures to enable recovery or continuation of vital technology infrastructure and systems following a natureal or human-induced disaster
- DRP typically is the domain of the technology department
What is Business Continuity?
- Maintain a minimum level of service in the event of a disaster or catastrophe
- It is about the ability to restore the district to business as usual
- It is planning to mitigate unanticipated risk…
What are the differences?
- BC is proactive, its focus is to avoid or mitigate the impact of risk
- DR is reactive, its focus is to pick up the pieces and to restore the organization to business as usual after a disaster happens
- DR is considered a subset of BC
- Do site servers, schools, or departments have built-in redundancy? Including RAID, etc.
- Are critical spares kept locally or at the district office?
- Are offsite spares, equipment available quickly?
- Are site and district servers backed up regularly?
- Are they getting backed up to the Central Data Centre?
- Are backups regularly verified and tested?
- Do you also backup offsite as well (secondary site, tape or cloud)?
- To be clear, good backups are NOT Disaster Recovery or Business Continuity!
Saanich is also looking for a tertiary backup system in addition to their primary and secondary. Could be cloud-based for data alone. Most services are warm or hot, so that the server is running, but the data may be a day out of date. Hot services, like e-mail are hot, and synchronized all the time.
Data Centre Safeguards
- Is the Data Centre secure?
- Does the Data Centre have environmental controls?
- Does the Data Centre have fire suppression designed for a computing environment?
- Does it have a backup generator?
Disasters or Catastrophes
- If you only maintain data backups how rapidly can you rebuild critical and non-essential systems?
- Do you maintain spares of servers, drives, power supplies, etc?
- How does the district establish essential servers and how quickly?
- Hence the need for a Disaster Recovery Plan
Saanich’s site has the data copied over, would need to change some IP’s and fire some equipment up and then they would be running. Wouldn’t it be great to have all of the central office services redundant to another site?
Disaster Recovery Planning
- Start with the basics
- Risk Analysis and Assessment is the first step
- Review and change Backup and Restore procedures if necessary
- Determine if a viable Failover site exists
- Determine if you are going to have a cold, warm, or hot Failover – if at all
- All of this is predicated on what the minimum amount of time each department/service is required to be operational
- Education, for instance, might only require access to server-based files
- HR/Finance/Payroll however might require minimal services in 72 hours but to be fully operational within a week
Gregg reviewed the basic steps for building a DR plan and also for BCP planning. The big difference being DR is really just IT, but BCP involves people and needs to include all of the Sr. Leadership team. BCP team meets monthly, and sometimes it’s hard to keep everyone on task, but it’s important.
Gregg has some concerns over the NGN network, as now all schools go directly to their Board Office Primary. If the Board Office goes down, it can take a long time, 30-45 days to re-route the NGN network. Therefore working on an architecture to get a failover site also connected to NGN network. Requires a Bias Failover line. Planning for everything to go well if the Board Office burns down, and then service can be running within about a week through the failover site.
Gregg is able to get better sleep at night as a result of the work and planning that their team has done.