Phrases like 'five nines' and '24x7x365' have become well worn over recent years. But, the truth remains that the risks of not having a plan in place to ensure system availability can prove to be more costly than creating and implementing one.
For small and large enterprises alike, business continuance plans have become a necessity. The risks of not having a plan in place to ensure 24/7 availability can be more expensive than putting one in place. The best possible plan must be created and implemented, because you cannot afford to do a bad job. If a disaster recovery plan is needed, then it had better work right the first time.
The stakes are higher today than ever before. According to the Fibre Channel Industry Association, system downtime costs can run into the millions of dollars per hours from $14,500/hour in lost automatic teller system (ATM) fees for a banking institution, to as much as $6.45 million/hour from disrupted operations in a stock brokerage firm.
To improve business continuance, Storage Area Networks (SANs) are being incorporated into enterprise systems utilizing a combination of redundant components, connections, software, and configurations, to minimize or eliminate single points of failure.
By reducing or eliminating single points of failure in enterprise environments, SANs help improve the overall availability of business applications. This high availability is achieved not through a single product, but through a comprehensive, fault-tolerant system design that includes all the components in the SAN and supports 24/7 uptime requirements. Delivering a high availability environment through a SAN requires establishing availability objectives, creating fault tolerance, and implementing an intelligent SAN infrastructure and fabric management.
With the increasing importance of the Internet and global e-business applications, more and more companies are implementing computing infrastructures specifically designed for at least 99.999 percent (the "five nines") availability, or the equivalent of less than 5.3 minutes of downtime a year.
Availability is a function of the frequency of outages (from unplanned failures or scheduled maintenance and upgrades) and the time to recover from those outages. Companies must identify specific availability requirements and predict potential failures in order to create a high availability solution that meets the needs of the organization. Objectives vary widely both among and within companies - some can tolerate no disruption, while others may be only minimally affected by short outages.
To address this uptime issue, many companies are now implementing networked fabrics of Fibre Channel devices to ensure a high-performance storage environment. These flexible SANs incorporate fault tolerance through redundancy, mirroring, hot-plugging capabilities, and no single points of failure. They also speed recovery through simplified fault monitoring, diagnostics, and non-disruptive server/storage maintenance and repair. The use of intelligent routing and rerouting, coupled with dynamic failover protection, minimizes human intervention during failover events.
Achieving True Fault Tolerance
One of the most effective ways to increase system availability is to implement fully redundant SANs consisting of alternate devices, data paths, and configurations. Particularly important is ensuring dual paths through separate components. This is especially true when physical location and disaster tolerance are a concern a single device cannot adequately address these issues.
For better availability, the focus shifts from servers to applications. Mission-critical applications should reside on highly available servers and storage devices so data can be accessed even during a failure. Sophisticated software enables application or host failover by moving workload to a secondary server, and clustering technology transfers workload to multiple active servers without disrupting data flow.
To further improve availability, servers should include redundant hardware components with dual power supplies, network connections and mirrored system disks. Servers should have multiple connections to alternate storage devices through Fibre Channel switches and a minimum of two independent connections to the SAN. In addition, these servers should feature dual-active or hot-standby configurations with automatic failover capabilities.
Another likely failure point for system availability is the path between the server and storage, including Host Bus Adapters (HBAs), cabling, fabrics or storage connections. Dual-redundant HBA configurations help ensure path availability and boost performance through the additional SAN connectivity.
For true fault tolerance, multiple paths must be connected to alternate locations within the SAN, or to difference switches in a multi-switch fabric, or to different blades within a core fabric switch, or even to different switch modules of an integrated fabric. To provide full redundancy, some companies choose a dual-SAN configuration. Server-based path-failover software typically allows a dual-active configuration for dividing workload between multiple HBAs. The software monitors the "health" of available storage products, servers, and physical paths, and automatically reroutes data traffic to an alternate path when failures occur.
Many of today's storage devices feature multiple connections to the SAN, a critical requirement for fault tolerance in storage solutions. This guards against failures from a damaged cable, controller, or SAN component such as a Gigabit Interface Converter (GBIC) optical module. mirror storage subsystems on a peer-to-peer basis across the fabric also creates highly available storage connections for fault tolerance. Combining mirroring with switch-based routing algorithms (to route traffic around path breaks within the SAN fabric) creates a resilient, self-healing environment for the most demanding enterprise storage requirements. The mirrored subsystems provide an alternate access point to data regardless of path conditions.
Putting it All Together in the SAN Infrastructure
The SAN infrastructure itself is one of the storage networks most critical components for ensuring system availability. Fibre Channel switches are extremely reliable, especially when they feature hot-pluggable, redundant power supplies and cooling, plus hot-pluggable Gigabit Interface Converter (GBIC) modules that enable single-port optics replacement without impacting other working devices.
The industry's top switches have "five nines" availability and include redundant components that further increase system uptime. Integrated fabric products go a step further by combining highly reliable switch modules, redundant-path architectures, and path-failure rerouting within the fabric. The integrated fabric can be incorporated into a larger SAN either as a core connectivity point or as an edge solution to address higher port-count requirements.
As scalability requirements grow, companies can incorporate higher-end 2 Gbps core products that improve scalability, management and multiprotocol support. (See Fig. 1). Core switches also protect investments by supporting future Fibre Channel technologies and alternative edge technologies. Plus, they add capabilities like frame filtering, which centralizes zoning to the logical unit (LUN) of the storage, improves storage resource sharing, and provides advanced performance monitoring and improved security.
Networking in the Fabric
Networks go beyond redundancy to provide a more resilient infrastructure than is possible with single-point products. With an infrastructure of switches, administrators can grow their network to meet high port count needs. Networking a fabric of switches increases availability, design flexibility and "pay-as-you-grow" scalability.
One of the easiest ways to increase availability in the SAN, is the use of a meshed-tree networking topology. In this topology, devices are connected to edge switches, which are then connected to central interconnecting switches, which are in turn connected to other parts of the SAN or other devices. The network can be scaled to provide higher bandwidth and redundant connectivity.
Although high-availability devices within an enterprise SAN contribute to the overall availability of the entire system, they do not guarantee it. True high availability can only be achieved through an end-to-end system composed of highly available components and devices, as well as fault-tolerant capabilities. Dual attachment of servers and storage devices to a single fabric enables workload sharing while avoiding system disruption from any single failure. Because the switches run their own independent firmware and dont share memory, this reduces the risk of a single switch impacting the entire network.
To help prevent localized failures from impacting the entire fabric, SAN sections can be isolated through the use of zoning, in which defined zones limit access between devices within the SAN fabric. This is especially effective as companies build larger SANs with heterogeneous operating systems and storage systems. Companies can specify different availability criteria at the connection, node, and network level to address the potential impact of certain types of outages. Zoning limits the types of device interactions that might cause failures. Hardware zoning provides the most secure method, especially when hardware is available across the enterprise fabric. Software zoning provides a more flexible but less secure approach.
The Key to High System Availability
Achieving higher availability through redundancy and fault tolerance begins with a thorough understanding of specific system uptime requirements and designing a solution to specifically address those requirements. Complete system outages can only be avoided by eliminating all potential single points of failure through redundancy of components, devices, connections, and paths. Multiple connectivity paths, clustering techniques, and dual fabrics all contribute to a fault-tolerant solution, and by physically separating devices, administrators can help build fault tolerance by protecting against localized physical disasters. Additionally, networks of fabric switches and core switches are less vulnerable to localized disasters that might impact the fault tolerance of the entire system. Together, these measures help organizations become more efficient and reliable and achieve true high system availability.
About The Author : Derek Granath is director of product marketing at Brocade, and responsible for product management of the Brocade switching platforms. During his career at Brocade, Granath has been instrumental in defining and launching several new product lines for the entry-level and enterprise market segments. Granath holds a BS in Electrical Engineering from Stanford University and an MBA from Santa Clara University.