Control Definition
Information processing facilities must be implemented with enough redundancy to meet the organization's availability requirements. That means identifying what availability each service actually needs, designing duplicated components, systems, or sites to deliver it, and verifying that failover to the redundant elements works.
Control Objective
To keep information processing facilities operating through component, system, or site failures, at the availability levels the business has committed to.
What This Really Means
Redundancy is the practice of buying away single points of failure — and the control's most important word is "requirements". You are not asked to duplicate everything; you are asked to duplicate enough that the availability promises you have made (in customer SLAs, in your business impact analysis, in regulatory commitments) survive the failure of any one part. The design conversation therefore starts with a number, not with hardware.
Redundancy comes in three layers, each an order of magnitude apart in cost. Component level: RAID arrays, dual power supplies, dual network cards, UPS — surviving the failure of a part inside one machine. System level: clustered servers, load-balanced application tiers, replicated databases — surviving the failure of a whole machine or instance. Site level: multi-availability-zone and multi-region cloud deployments, or a secondary data center — surviving the loss of an entire facility. A sensible architecture tiers its services and spends accordingly: the revenue-bearing platform may justify multi-region, while the internal wiki gets a nightly backup and a documented tolerance for a day of downtime.
Two disciplines make the difference between real redundancy and a diagram. First, independence: redundant elements must not share a failure domain. Two "redundant" network links that enter the building through the same conduit, two VMs that land on the same physical host, two power feeds from the same substation — these fail together, and finding such shared dependencies is the actual engineering work of this control. Second, failover testing: redundancy that has never been exercised is theoretical. Pull the component, drain the node, evacuate the zone — on a schedule, with the results written down.
Keep the boundary with A.8.13 sharp, because auditors will. Redundancy keeps the service up when things break; backup gets the data back after it is lost or corrupted. Replication is not backup — it copies your deletions and your ransomware faithfully to every replica. The heart of the control at audit time: documented availability requirements per service, an architecture demonstrably matched to them, and evidence that failover has been tested rather than assumed.
Why It Matters
Availability is one third of the CIA triad, and it is the third that customers, regulators, and revenue notice first. Confidentiality failures surface in disclosure letters months later; availability failures surface on status pages within minutes. An organization that has committed to uptime in contracts — explicitly, or implicitly by being the system its users depend on — has already made redundancy promises; this control checks whether the architecture and the spend actually honor them.
The expensive failures here are rarely exotic. They are the single database instance behind a "highly available" application tier, the failover cluster that was never failed over until a real outage exposed a hardcoded IP, the two ISP links that shared a duct a backhoe found, and the DR environment three patch cycles behind production that could not take load when finally asked. Each of these is findable in advance — by mapping failure domains and by testing — which is exactly what the control requires.
Insufficient or untested redundancy exposes the organization to:
- •SLA breach and contractual penalties – committed availability percentages turn into service credits, penalty clauses, and renewal-time leverage for customers
- •Revenue and operations stoppage – for digital businesses, platform downtime is a direct revenue meter running backwards, plus the recovery cost on top
- •Hidden single points of failure – shared conduits, shared hosts, shared upstream providers make paper redundancy fail in pairs
- •Failover that fails when it matters – untested redundancy regularly collapses on small dependencies during real incidents, doubling outage duration
- •Regulatory exposure in critical sectors – financial and infrastructure regulators treat resilience as a supervised obligation, not an internal preference
Regional Compliance Context
Availability is a supervised outcome in Indian financial services: RBI master directions expect regulated entities to define recovery objectives for critical systems and prove them through periodic DR drills, and SEBI's CSCRF sets comparable resilience expectations for market intermediaries — so for BFSI workloads, site-level redundancy and its test records are inspection material, not just ISO evidence. Data-residency rules can also shape the design: where Saudi PDPL, the UAE federal PDPL, or sectoral Indian rules constrain cross-border data transfers, the failover region must satisfy the same residency conditions as the primary. A multi-region architecture that fails over into a non-compliant jurisdiction trades an availability problem for a legal one.
Implementation Guidance
Establish Availability Requirements per Service
Extract the real numbers from customer SLAs, the business impact analysis, and conversations with service owners: target uptime, maximum tolerable downtime, and recovery time objectives. Classify services into availability tiers (for example: critical / important / standard) and get the tiering signed off by the business — it is the basis for every spend decision that follows.
Map Single Points of Failure Across Each Critical Path
Walk the full path of each top-tier service: power, network entry points, hardware, hypervisors, software instances, data stores, supporting services like DNS and identity, sites, and third parties (single ISP, single cloud region, single SaaS dependency). Record every single point of failure in the risk register with an owner and a decision — eliminate, mitigate, or accept.
Select the Redundancy Level Each Tier Justifies
Match mechanism to tier: component redundancy (RAID, dual PSU/NIC, UPS) as a baseline for important hardware; system redundancy (clustering, load balancing, database replicas) for services that cannot wait for a rebuild; site redundancy (multi-AZ, multi-region, or a secondary facility) only where the availability requirement genuinely demands it. Document why each tier gets what it gets — proportionality is a feature, not a confession.
Engineer Independence Between Redundant Elements
Verify that redundant elements share no failure domain: separate availability zones, anti-affinity rules so instances never share a host, diverse network paths from different providers entering at different points, independent power feeds. Ask providers for diversity confirmation in writing where it matters — assumed independence is the classic way redundancy fails in pairs.
Implement Health Checks and Failover Mechanisms
Automate where the tier justifies it: load-balancer health checks, cluster quorum and automatic failover, DNS-based traffic steering, database replica promotion. Where failover is manual, write the runbook — trigger conditions, decision authority, exact steps, verification — and keep it current. Either way, alert the moment the system is running on its redundant path, because redundancy silently consumed is redundancy you no longer have.
Test Failover on a Schedule and Record the Results
Exercise the redundancy deliberately: pull a component, drain a node, evacuate an availability zone, switch to the DR site. Start in maintenance windows with low-risk services and grow toward production-realistic drills. Measure achieved recovery against the tier's targets, record date, scope, result, and issues, and feed fixes back into the architecture. Coordinate larger exercises with continuity testing under A.5.30.
Monitor, Review, and Re-Verify After Change
Track availability per service against its target and review misses. Size redundant capacity with A.8.6 in mind — the surviving half must absorb the full load, which N+1 sizing exists to guarantee. Re-run the single-point-of-failure analysis after major architecture changes, and review the redundancy tiering annually against current commitments: contracts change faster than infrastructure.
Audit Evidence
During your ISO 27001 certification audit, auditors will expect to see the following evidence to demonstrate compliance with A.8.14:
Documentation
- Documented availability requirements and service tiering, traceable to SLAs or the business impact analysis
- Architecture diagrams marking redundant components, failure domains, and site-level arrangements
- Failover and DR test records with dates, scope, measured recovery times, and corrective actions
- Availability monitoring reports comparing achieved uptime to committed targets
- Failover runbooks for manual procedures, with version history showing they are maintained
Interviews
- Infrastructure or platform lead on how redundancy levels were chosen and how failure domains were verified independent
- Service or business owner on what availability was committed to customers and whether the tiering reflects it
- On-call engineer on what actually happens when a node or zone fails — and whether practice matches the runbook
Observations
- Cloud console or cluster configuration showing multi-AZ placement, replicas, and anti-affinity rules in effect
- Load-balancer health checks and the alerting that fires when a service is running on its redundant path
- Artifacts of a recent failover exercise — drill logs, chaos test output, or a live demonstration on a low-risk service
Practitioner Insights

I make a habit of cross-reading customer contracts against architecture diagrams, and the mismatch is a classic management-level failure: sales has committed 99.9% availability while the production database runs as a single instance in a single zone. Certification auditors do the same cross-reading, and so do customers' due-diligence teams. Either the architecture rises to the commitment or the commitment comes down to the architecture — and if leadership consciously accepts the gap, that acceptance belongs in the risk register with a signature, not in a corridor conversation.

In the cloud, redundancy is mostly configuration you have to deliberately switch on and then pay for — multi-AZ flags, a minimum of two instances behind a load balancer, a replica in a second zone. The implementation mistake I see most is buying the redundancy and never pulling the plug: the first real failover then trips over some small dependency nobody noticed, like a hardcoded IP, a single NAT gateway, or a license server that lived in the dead zone. Kill an instance in a maintenance window, watch what actually happens, and keep a one-page record of it. That single exercise is worth more than any diagram.
Common Challenges & Solutions
Challenge
Nobody has defined what availability each service actually requires, so redundancy decisions are guesswork and budget arguments.
Solution
Run a lightweight business impact analysis: for each service, ask the owner what an hour, a day, and a week of downtime costs, and what has been promised externally. Convert the answers into three or four availability tiers with explicit targets, get management sign-off, and let the tiering drive both the architecture and the spend conversation.
Challenge
Redundant elements secretly share a failure domain — both links in one conduit, both VMs on one host, both feeds from one substation — and fail together.
Solution
Treat independence as a verification exercise, not an assumption. Map the physical and logical path of every redundant pair, apply anti-affinity and zone-separation rules in virtualized and cloud environments, source network diversity from genuinely different providers and entry points, and ask suppliers to confirm diversity in writing for the paths that matter most.
Challenge
Failover has never been tested, so the first test is a real outage — and it fails.
Solution
Schedule failover exercises like any other control activity: quarterly or semi-annual drills for top-tier services, starting with low-risk components in maintenance windows and maturing toward zone-evacuation or DR-switchover exercises. Document each test's measured recovery time and defects, and fix the defects before scaling the next drill up.
Challenge
Site-level redundancy for everything is unaffordable, and cost pressure threatens to delete redundancy where it is genuinely needed.
Solution
Spend by tier. Reserve multi-region or secondary-site arrangements for the services whose availability requirements prove the need; for lower tiers, accept measured downtime with a documented risk acceptance and rely on tested restores under A.8.13 instead. An explicit, signed decision to tolerate downtime on the wiki is good governance; an implicit single point of failure on the payment platform is the finding.
Challenge
Redundancy decays as the estate changes — new services launch single-instance, and yesterday's resilient architecture quietly grows new single points of failure.
Solution
Gate it through change management (A.8.32): every new service declares its availability tier and the redundancy that tier mandates before go-live, using reference architectures with redundancy defaults built in. Re-scan for single points of failure after major changes and at least annually, and alert on configurations that drift below tier requirements.