What should be in my IT disaster recovery plan?
What every business needs in a disaster recovery plan: essential components, testing requirements, and common mistakes that make plans fail.
Key Takeaways
- A disaster recovery plan defines how you restore IT systems after an incident - it's your playbook for the worst day
- Two critical metrics: RTO (how fast you need systems back) and RPO (how much data you can afford to lose)
- Essential components: backup strategy, communication plan, roles and responsibilities, vendor contacts, and step-by-step procedures
- Plans that are never tested are plans that will fail - test at least annually with quarterly checks for critical systems
- Business continuity keeps operations running during disruption; disaster recovery restores systems after - you need both
Most businesses don’t have a disaster recovery plan. Of those that do, most have never tested it. And of those that have tested it, many discovered it didn’t actually work. Here’s how to build one that does.
What Is a Disaster Recovery Plan?
A disaster recovery plan (DRP) is a documented, step-by-step playbook for restoring your IT systems after a disruptive event. It answers a simple question: When something goes terribly wrong, what exactly do we do?
“Something terribly wrong” can mean many things:
- Ransomware encrypts all your files and servers
- A fire or flood damages your office and equipment
- A server fails and takes your core applications offline
- A cloud provider has a major outage
- A disgruntled employee deletes critical data
- A power surge fries your network equipment
- A cyberattack compromises your systems
A disaster recovery plan doesn’t prevent these events (that’s what your cybersecurity and maintenance programs do). It ensures that when they happen — and eventually, something will happen — you can recover quickly and with minimal data loss.
Disaster Recovery vs. Business Continuity
These terms are often used interchangeably, but they’re different:
Business Continuity Plan (BCP): How your business keeps operating during a disruption. This covers operations, communications, personnel, and temporary workarounds. Example: if your office is flooded, where do employees work? How do customers reach you?
Disaster Recovery Plan (DRP): How you restore your IT systems and data after a disruption. This is the technical playbook for getting servers, applications, data, and communications back online.
You need both. Business continuity keeps the business running (even in a degraded state) while disaster recovery gets your systems restored. A DRP is typically a subset of a broader BCP.
| Aspect | Business Continuity | Disaster Recovery |
|---|---|---|
| Focus | The entire business | IT systems and data |
| Goal | Keep operating | Restore systems |
| Timeframe | During the event | After the event |
| Covers | People, processes, facilities, technology | Servers, applications, data, network |
| Example | Employees work from home during a flood | Servers are restored from backup after a ransomware attack |
The Two Most Important Metrics: RTO and RPO
Before you build your plan, you need to define two numbers for every critical system in your business:
RTO (Recovery Time Objective)
How fast do you need this system back?
RTO is the maximum amount of time you can tolerate a system being down before the impact becomes unacceptable.
Examples:
- Email: RTO of 4 hours (you can survive a few hours without it, but not a day)
- Point-of-sale system: RTO of 1 hour (every minute without it costs you sales)
- Accounting system: RTO of 24 hours (not ideal, but you can manage for a day)
- Development server: RTO of 48 hours (annoying but not business-critical)
RPO (Recovery Point Objective)
How much data can you afford to lose?
RPO is the maximum amount of data loss you can tolerate, measured in time. If your RPO is 4 hours, you’re saying “we can accept losing the last 4 hours of data.”
Examples:
- Financial transactions: RPO of 15 minutes (you can’t lose sales data)
- Email: RPO of 1 hour (losing an hour of email is manageable)
- File shares: RPO of 24 hours (if you lose a day of file changes, you can recover)
- Archive data: RPO of 1 week (changes infrequently)
Why RTO and RPO Matter
Your RTO and RPO directly determine your backup strategy, your infrastructure requirements, and your costs.
| RPO | Backup Strategy Needed |
|---|---|
| Near zero | Real-time replication (most expensive) |
| 15 minutes | Continuous data protection or frequent snapshots |
| 1 hour | Hourly backups |
| 4 hours | Backups every 4 hours |
| 24 hours | Daily backups (most common for SMBs) |
| RTO | Infrastructure Needed |
|---|---|
| Minutes | Hot standby (duplicate systems running, ready to take over) |
| 1-4 hours | Warm standby (systems configured, just need to be started and data loaded) |
| 8-24 hours | Cold standby (hardware available, needs full restoration from backup) |
| 24-72 hours | Basic backup restoration (restore to new or repaired hardware) |
The faster and tighter your targets, the more it costs. That’s why it’s important to set different RTO/RPO values for different systems based on their actual business impact, rather than making everything “as fast as possible.”
Essential Components of a Disaster Recovery Plan
1. Business Impact Analysis
Before you plan how to recover, you need to understand what matters most. A business impact analysis (BIA) identifies:
- Critical systems — which applications and services must be running for your business to operate?
- Dependencies — what does each system depend on? (network, internet, power, other applications)
- Impact of loss — what happens if each system is down for 1 hour? 4 hours? 24 hours? 1 week?
- Priority order — in what sequence should systems be restored?
A practical approach for small businesses:
List every system your business uses. For each one, answer: “If this system were down right now, what would happen?” Rate the impact as Critical (business stops), High (major disruption), Medium (significant inconvenience), or Low (minor impact).
Critical and High systems get the lowest RTO/RPO targets and the most robust backup strategies. Low-impact systems can tolerate longer recovery times.
2. Backup Strategy
Your backup strategy is the foundation of disaster recovery. Without reliable backups, your plan is just a document.
The 3-2-1 Rule
At minimum, follow the 3-2-1 backup rule:
- 3 copies of your data (original plus two backups)
- 2 different types of storage media (local plus cloud, or disk plus tape)
- 1 copy off-site (in case your office is destroyed)
Modern Best Practice: 3-2-1-1-0
Many organizations now follow an enhanced version:
- 3 copies of your data
- 2 different storage types
- 1 off-site copy
- 1 immutable (unchangeable) copy (protects against ransomware)
- 0 errors (verified through regular testing)
What to Back Up
- Servers (full system images, not just data)
- Cloud data (Microsoft 365, Google Workspace — yes, cloud data needs backup too)
- Databases
- Application configurations
- Network device configurations
- User data and workstations (if applicable)
Backup Frequency
| Data Type | Recommended Frequency |
|---|---|
| Critical databases | Every 15-60 minutes |
| File servers | Daily (minimum) |
| Email / cloud apps | Daily |
| Full server images | Weekly |
| System configurations | After every change |
3. Roles and Responsibilities
When disaster strikes, everyone needs to know their role. Ambiguity during a crisis leads to wasted time, duplicated effort, and critical tasks falling through the cracks.
Key roles to define:
-
Disaster Recovery Coordinator — the person in charge of the overall recovery effort. This should be a specific person, not “whoever is available.” Name a primary and a backup.
-
IT Recovery Team — the technical people (internal or MSP) who will actually restore systems. Define who handles what: servers, network, cloud, workstations, phones.
-
Communications Lead — who communicates with employees, customers, vendors, and (if needed) media? What do they say? Through what channels?
-
Business Operations Lead — who manages the business side during the disruption? Workarounds, manual processes, customer handling.
-
Executive Decision Maker — who has authority to approve emergency spending, activate contracts, or make strategic decisions during the crisis?
4. Communication Plan
Communication during a disaster is critical and frequently mishandled. Your plan should address:
Internal Communication
- How will you notify employees? (If email is down, you need alternatives: phone tree, text messaging, personal cell phones)
- What will you tell them? (Status, expected duration, what to do, where to go)
- How often will you update them? (Set a schedule: hourly during active incidents)
Customer Communication
- Who contacts customers? (Sales? Customer service? Executive team?)
- What do you tell them? (Honest, concise, with expected resolution time)
- Through what channels? (Phone, email, social media, website banner)
- When do you escalate? (At what point does a major customer get a personal call from leadership?)
Vendor Communication
- Who contacts your IT provider/MSP? (Don’t assume they know — have a direct escalation path)
- Who contacts your cloud vendors? (Microsoft, Google, hosting providers)
- Who contacts your ISP? (If the outage is network-related)
- Who contacts your insurance carrier? (Cyber insurance notification is often time-sensitive)
Pre-Build These Templates
Write communication templates before you need them. During a crisis, you don’t have time to draft carefully worded messages. Have templates ready for:
- Employee notification (initial and updates)
- Customer notification (initial and updates)
- Social media posts
- Vendor escalation requests
- Insurance carrier notification
5. Vendor and Contact Information
Compile a contact list that includes:
- Internal team members (personal cell phones, not just work numbers)
- IT provider / MSP (escalation paths, after-hours contacts)
- Internet service provider
- Cloud service providers
- Hardware vendors (for emergency replacement)
- Cyber insurance carrier (policy number and claims contact)
- Legal counsel
- Key customers (who need personal notification)
- Building management / facilities
Store this list in multiple locations. If it’s only on the server that just crashed, it’s useless. Keep printed copies, store it in a personal cloud account, save it on your phone.
6. Step-by-Step Recovery Procedures
For each critical system, document the specific steps to restore it. These procedures should be detailed enough that someone unfamiliar with the system could follow them in an emergency.
For each system, document:
- What the system is and what it does
- Where the backup data is stored and how to access it
- The exact steps to restore (not “restore from backup” but specific, numbered steps)
- How to verify that the restoration was successful
- Who to contact if the procedure doesn’t work
- Any dependencies (what must be restored first?)
Recovery order matters. You can’t restore applications before you restore the server they run on. You can’t restore the server before you restore the network. Document the correct sequence:
- Network infrastructure (firewalls, switches, routers)
- Core servers (Active Directory, DNS, DHCP)
- Critical applications (line-of-business software, databases)
- Communication systems (email, phones)
- Secondary systems (file shares, printers, non-critical apps)
- User workstations
7. Alternative Operations Procedures
How does the business operate while systems are being restored? Document workarounds:
- If email is down — use personal email or phone for urgent communication
- If phones are down — forward to cell phones, use a temporary answering service
- If the office is inaccessible — where do people work? Do they have VPN and laptop access?
- If the CRM is down — how do you track customer interactions temporarily?
- If the payment system is down — can you take manual payments and process later?
These workarounds don’t need to be pretty. They need to keep the business running until systems are restored.
Testing Your Disaster Recovery Plan
An untested plan is not a plan. It’s a guess.
The most common disaster recovery failure is the plan that looked great on paper but fell apart in practice because nobody ever tested it. Backups were corrupt. Recovery steps were outdated. Contact information was wrong. The estimated recovery time was wildly optimistic.
Types of DR Testing
Tabletop Exercise (Quarterly)
Gather your recovery team around a conference table and walk through a scenario verbally. “The server is down. What do we do first? Who do we call? How do we restore?” This identifies gaps in the plan without actually touching any systems.
Time required: 1-2 hours Disruption: None
Checklist Walkthrough (Quarterly)
Go through the plan step by step and verify each component: Are the contact numbers still correct? Are the backup locations still accessible? Are the procedures still accurate? Is the documentation still current?
Time required: 2-4 hours Disruption: None
Backup Verification (Monthly)
Actually restore data from your backups and verify it’s complete and usable. Pick a random server or database and perform a test restore to a non-production environment.
Time required: 2-4 hours Disruption: Minimal (test environment only)
Partial Recovery Test (Semi-Annually)
Restore one or two critical systems from backup to verify the full recovery procedure works. Time the process to validate your RTO estimates.
Time required: 4-8 hours Disruption: Moderate (may need maintenance window)
Full Recovery Test (Annually)
Simulate a complete disaster scenario and restore all critical systems from backup. This is the only way to truly validate your plan and your RTO/RPO targets.
Time required: 1-2 days Disruption: Significant (requires planning and coordination)
What to Document After Every Test
- What worked
- What didn’t work
- Actual recovery times vs. planned RTO
- Data integrity verification results
- Contact list accuracy
- Updated procedures based on findings
- Action items for improvement
Common Disaster Recovery Mistakes
1. No Plan at All
The biggest mistake is not having a documented plan. “We know what to do” is not a plan. Knowledge that exists only in someone’s head disappears when that person is on vacation, leaves the company, or is unavailable during the emergency.
2. Backup Without Testing
“We have backups” is meaningless if you’ve never restored from them. We’ve seen businesses discover during an actual disaster that their backups were incomplete, corrupted, or hadn’t been running for months. Test your backups regularly.
3. Single Point of Failure in the Plan
If only one person knows the recovery procedures, or the plan is stored only in one location, or you rely on a single vendor for everything, your plan has a single point of failure. Build redundancy into the plan itself.
4. Unrealistic RTO/RPO Targets
Setting every system to a 15-minute RTO without the budget to support it means your plan is fiction. Be honest about what you can actually achieve with your current infrastructure and budget.
5. Forgetting Cloud Services
Many businesses assume cloud services don’t need disaster recovery. They do. Microsoft 365 doesn’t guarantee your data against accidental deletion, ransomware, or malicious insiders. Back up your cloud data too.
6. Outdated Contact Information
Contact lists that haven’t been updated in two years are useless in an emergency. Review and update quarterly.
7. No Communication Plan
The technical recovery might go perfectly, but if nobody communicated with employees and customers during the outage, you’ve still failed. Communication is as important as restoration.
8. Ignoring Cyber Incidents
Many DRPs are designed around hardware failures and natural disasters but don’t address ransomware or cyberattacks. Given that ransomware is now the most common cause of extended downtime for SMBs, your plan must include cyber incident procedures — and those procedures are different from hardware recovery.
9. Never Updating the Plan
Your IT environment changes constantly: new applications, new servers, new cloud services, new employees. If your plan doesn’t change with it, it becomes increasingly irrelevant. Review and update at least annually, and after any major IT change.
Disaster Recovery and Your IT Provider
Your MSP or IT provider should be deeply involved in your disaster recovery planning. Here’s what to expect from them:
What They Should Provide
- Backup management — configuring, monitoring, and testing your backups
- Recovery procedures — documented, tested steps for restoring your systems
- Monitoring — 24/7 monitoring that detects problems early and triggers the DR process when needed
- Infrastructure — redundant systems, failover capabilities, and cloud recovery options
- Testing support — helping you conduct regular DR tests and drills
- Documentation — keeping technical recovery procedures current
Questions to Ask
- What is our current backup strategy, and does it meet our RTO/RPO requirements?
- When was the last time you tested a restore from our backups?
- What happens if our primary server fails at 2 AM on a Saturday?
- Do you have documented recovery procedures for our specific systems?
- How do you handle a ransomware scenario versus a hardware failure?
- What’s our actual recovery time based on your experience with similar environments?
Red Flags
- They can’t tell you when they last tested your backups
- Recovery procedures aren’t documented
- There’s no after-hours support plan
- They don’t differentiate between hardware failure and cyber incident recovery
- Your data isn’t stored off-site or in an immutable format
The Bottom Line
A disaster recovery plan is your insurance policy for your technology. You hope you’ll never need it, but when you do, it’s the difference between a bad day and a business-ending event.
The essential elements are straightforward:
- Know what matters most (business impact analysis)
- Set realistic recovery targets (RTO and RPO)
- Build a robust backup strategy (3-2-1 minimum)
- Define who does what (roles and responsibilities)
- Plan your communication (internal, customer, vendor)
- Document step-by-step procedures (specific enough for anyone to follow)
- Test regularly (tabletop quarterly, full test annually)
- Keep it updated (review after every change and at least annually)
The businesses that recover quickly from disasters aren’t lucky. They’re prepared. Build and test your plan before you need it.
Need help building or testing your disaster recovery plan? centrexIT helps businesses design backup strategies, document recovery procedures, and conduct DR testing. Contact us to review your disaster readiness.
Have More Questions?
Our team is here to help. Whether you're evaluating IT services or have a specific question about your technology, we're happy to have a conversation.