High Availability and Disaster Recovery: Mechanisms and Data Recovery

Slides about High Availability and Disaster Recovery. The Pdf explores HA and DR mechanisms, including full, differential, and incremental backup types. This University-level Computer Science material provides a clear overview of operational continuity and data recovery, useful for self-study.

See more

59 Pages

High Availability and
Disaster Recovery
DOMAIN 3.0
MODULE 12
High Availability and Disaster Recovery Topics
HA and DR Concepts
High Availability Mechanisms
Disaster Recovery Mechanisms
Facility and Infrastructure Support

Unlock the full PDF for free

Sign up to get full access to the document and start transforming it with AI.

Preview

High Availability and Disaster Recovery Topics

HA and DR Concepts

High Availability and Disaster Recovery DOMAIN 3.0 MODULE 12 21 23High Availability and Disaster Recovery Topics HA and DR Concepts High Availability Mechanisms Disaster Recovery Mechanisms Facility and Infrastructure SupportHA and DR ConceptsHigh Availability 20/9 A system, network, or service that is continuously operational for a desirable length of time Availability is measured in "9s"

  • 90 % = "one nine" 36.5 days down in a year
  • 99 % = "two nines" 87.6 hours (~3-1/2 days) down in a year
  • 99.9% = "three nines" 8.76 hours down in a year
  • 99.99% = "four nines" ~ 53 minutes down in a year ◦ 99.999% = "five nines" ~ 5 minutes down in a year

High Availability Mechanisms

High Availability Mechanisms include:

  • Fault tolerance
  • Redundant components or systems
  • Load balancing
  • Clustering
  • NIC teaming
  • Port aggregation - 99.999% uptime

Fault Tolerance

Capability of a system or network to provide uninterrupted service if one or more of its components fail No single point of failure

  • Avoids losing data or connectivity
  • Failover is rapid and automatic System 1 System 2 System 4 Service 1 Service 2 Service 3 Service Service 2 Service 3 Service Sende Service Service 1 Service 2 Service 3 1 1 1

Mean Time To Failure (MTTF)

One of many metrics used to evaluate the reliability of a manufactured product

  • Usually published by the manufacturer Used to estimate the normal lifespan of products that are not repairable
  • Or repair cost exceeds replacement cost Will be shortened by improper usage Typically devices are field replaceable units (FRUs)
  • If possible, replace before actual failure

Field Replaceable Units

Examples of FRUs that should be replaced, not repaired:

  • Hard drives
  • Fans
  • Video cards
  • Motherboards
  • CPUS
  • RAM
  • Surge protectors
  • Power supplies
  • Other peripherals such as monitor, keyboard, mouse, card readers, CD/DVD drives
  • Cables (if they have reached expected lifespan and are starting to break down)

Mean Time Between Failures (MTBF)

Refers to repairable devices . How long the device/system is expected to function until its first failure . How long after first repair before device is expected to fail again Can (hopefully) be extended by proper maintenance Estimates only, but important in planning, implementation, maintenance, and future plans

Mean Time To Repair (MTTR)

How long it will take to repair a device, system, or component that is down and bring it back online Assumption is that the device/system can be repaired Critical metric in planning data center/cloud/system configuration and future configurations

Differentiating Between MTBF and MTTR

Time Between Failures Time to Repair Time to Failure System Failure Resume Normal Operations System Failure

Recovery Time Objective (RTO)

When restoring a system, the maximum allowable time that can elapse before the system is available again Include Try to be back online at this time Repair Time

Recovery Point Objective (RPO)

When recovering a system, the level of original functionality to be restored before bringing the system back online Sometimes data is lost so you cannot fully recover the system to its original state ◦ Sometimes you sacrifice level of recovery in the interest of quickly making the system available again RPO is often used in database recovery ◦ Identifies the last saved transaction to be restored before the database is made available again Any transactions after the RPO will have to be manually re-entered ◦ System had this much functionality / data before going down Functionality / Data Level Acceptable loss Restore this much before bringing system back online

High Availability Mechanisms

Load Balancing

Two or more systems simultaneously provide the same service If one node fails, the other node(s) continue to provide service Especially good resilience against denial-of-service attacks All systems have their own IP address, but share a common virtual IP address Clients connect to the virtual IP address Systems do NOT share a common database/data files Systems are typically "front end" web sites that "point" to a common back end database server Can be hardware or software solution

Load Balancing Example

Load Balancing Cluster All front end webserver nodes active Back End Database Server 2 192.168.1.20 3 1 192.168.1.30 Virtual IP 192.168.1.10 Client 4 192.168.1.40

Multipathing

A generic term for any redundant network path Can refer to:

  • Multiple links between servers and SAN storage ◦ Most common usage
  • Redundant links at the same network layer Redundant links between network layers ◦ ◦
  • Multiple links to the same ISP
  • Multiple ISPs

Multipathing Examples

Server 10.10.10.21 10.10.20.21 10.10.10.0/24 10.10.20.0/24 VLAN1 Switch VLAN2 10.10.10.51 10.10.20.51 SAN A 10.10.10.52 10.10.20.52 SAN B Core Redundant Path to Core Distribution Access A B

NIC Teaming

Network Interface Card teaming combines multiple NICs/connections to create a single "link" Aggregates bandwidth ◦ ◦ Increases performance ◦ Provides fault tolerance Also known as aggregation, balancing, and bonding

Clustering

Both Sides Redundant hardware acting together as a unit Two or more systems provide a single service ◦ Systems typically share a common database/files ◦ All systems have their own IP address, but also share a common virtual IP address ◦ Clients connect to the virtual address ◦ Active/Passive: One system is active ◦ The other system is in standby (passive) mode ◦ Passive system listens to the "heartbeat" of the active system ◦ ◦ If it stops hearing the heartbeat, the passive system takes control of the data/database/service Active/Active: Both systems are active ◦ Each system is the primary provider for a different service (e.g. SQL and Exchange) ◦ Each system acts as backup for the other ◦ ◦ Either system can take over both services

Active / Passive Clustering Example

Node IP 192.168.1.20 Active Node One side of a cord Shared Storage 192.168.1.10 Cluster IP Client 192.168.1.30 Node IP Passive Node next slide on the other

Active / Active Clustering Example

on other Node IP 192.168.1.20 - App 1 Active Node App 2 Passive Node App 1 App 2 192.168.1.10 Cluster IP Client App 2 192.168.1.30 Node IP App 1 App 2 Active Node App 1 Passive Node Side of previous shared Storage card

Redundant Switches Example

Potential Single Points of Failure Access Distribution : Core 119977Too Much Redundancy!

Port Aggregation

Logical aggregation of Ethernet switch ports Used to increase the bandwidth of a "single" link Commonly used in uplinks / trunk links Also referred to as EtherChannel Two common methods: Cisco proprietary PAgP ◦ ◦ Vendor-neutral LACP (IEEE 802.3ad / 802.1ax)

Port Aggregation Example

- Room on Card for previous Physical View Multiple ports defined as part of an EtherChannel Group Logical View Different subsystems running on the switch see only one large link slide?

Redundant Routers

You can cluster routers First Hop Redundancy Protocols (FHRP) ◦ A class of mechanisms that allow default gateway redundancy Virtual Router Redundancy Protocol (VRRP) Standards-based FHRP ◦ ◦ Active and Standby routers are organized into a Standby group They share a virtual IP and virtual MAC ◦ Active router is configured with a higher priority so it is preferred ◦ ◦ The standby router has a lower priority, but can take over for the active at any time Cisco has proprietary FHRPs as well: HSRP - very similar to VRRP ◦ ◦ GLBP - like HSRP but also allows Active - Active load balancing side 1 Internet - VRRP Virtual IP Virtual MAC Active Standby 1 Side 2 Client Default Gateway: Virtual IP, Virtual MAC

Redundant Routes

The routing protocol must decide the best path More redundancy = more fault tolerant = more expensive 1 1 1 !! 1 1 I L 1 1 1 I 1 1 1

Redundant Firewalls

ISP

Multiple ISPs / Diverse Paths

Configure ISP2 as a standby link Or load balance between the two ISP 1 ISP 2

Disaster Recovery Mechanisms

Backups

Protect against hardware failures, data loss, corruption, disasters (manmade or natural) etc. Backup Types

  • Full, differential, and incremental
  • Snapshots
  • OS and configurations (network devices) Backup Destinations:
  • Local disks
  • Network shares
  • Cloud

Windows Server Backup Tool

X wbadmin - [Windows Server Backup (Local)\Local Backup] File Action View Help Windows Server Backup (L Local Backup Local Backup 1 Actions Local Backup You can perform a single backup or schedule a regular backup using this application. Backup Schedule ... Backup Once ... No backup has been configured for this computer. Use the Backup Schedule Wizard or the Backup Once Wizard Messages (Activity from last week, double click on the message to see details) Time Message Description i 3/8/2018 10:55 PM Backup Successful ? Help Status Last Backup Next Backup AI Status: Successful Status: Not scheduled To Time: 3/8/2018 10:55 PM Time: La View details OI + View details Recover ... Configure Performanc ... View A V

Full Backup

The most basic and complete type of backup Backs up all selected data to another set of media cloud, network share, local disk, tape ◦ Provides a foundation for the other backup types Changes the file's archive bit Longest backup time Shortest restore time Full backup on Sunday

Differential Backup

Differential Backup Copies all data changed since last full backup Does not change the archive bit Can be thought of as a "running backup" Typically get larger over time until the next full backup Backup takes longer each day as the week goes by Restore takes less time as you only need the full plus the latest differential Fri Thu Mon Tue Mon Wed 1 Mon I Mon Mon Full backup on Sunday

Incremental Backup

Copies only the data that has changed since the last full or incremental backup Restore requires the full backup plus all subsequent incremental backups Changes the archive bit Fastest type of backup Longest restore You will have to restore the full ◦ Plus every differential in order ◦ Takes up less storage space Incremental Backup Fri Thu Thu Wed Wed Wed Tue Tue Tue Tue Mon Mon Mon Mon Mon Full backup on Sunday

Snapshot

Own Side Windows operating system baseline Used to restore a virtual machine to a previous state An image of the state of a VM at a point in time Typically requires the base image plus the snapshot(s) As with backups, you can revert to the snapshot of your choice SP1 IE base 1 You Are Here Firefox base 1 SP2 IE base2 Firefox base2 SP = Snapshot Should not be your only backup solution IE base Firefox base

Can’t find what you’re looking for?

Explore more topics in the Algor library or create your own materials with AI.