Businesses and organizations usually deliver their IT services to big data centers. The data centers, in turn, are responsible to provide the highest level of reliability and availability to business owners. Field studies report the annual market effect of downtime and data loss by over $100 billion dollars. Even a few minutes of service downtime can affect a business reputation and some cases result in business bankruptcy. Many causes can threaten the continuous service and reliability of a storage system including unmanaged events such as human errors, processor and board failures, power failures, and network failures, and managed events such as updates, upgrades, taking backups, and recovery. The storage designers should be aware of the possibility of these incidences and assure minimum downtime in the case of managed and unmanaged events.
SAN storage systems are responsible for 24/7 continuous service while the storage manufacturers should report the availability provided by their product. Storage availability depends on the availability of all hardware and software components in the storage stack and data center. The table below shows conventional metrics reporting storage system availability:
Availability % | Unavailability % | Downtime per Year | Downtime per Week |
---|---|---|---|
99% | 1% | Less than 4 days | Less than 2 hours |
9/99% | 0.1% | Less than 9 hours | 11 Minutes |
99/99% | 0.01% | Less than 1 hours | 1 Minutes |
999/99% | 0.001% | Less than 6 minutes | Less than 6 seconds |
9999/99% | 0.0001% | Less than 30 seconds | Less than 0.6 seconds |
A data storage system should provide continuous service by tolerating the failure of hardware and software, using appropriate redundancies and dependability mechanisms such as:
Mechanisms for decreasing the component failure rate, decreasing the failure effects in system level, decreasing the repair/recovery time, and removing single point of failure (SPF) can improve the availability of data storage systems. We can also note mechanisms for disaster recovery such as remote backups and mirrors.
Disk subsystem of SAN storage systems is composed of different components, each of which considered as SPF. Hence the failure and each component result in the entire system failure and fault tolerance mechanisms should be applied to all components of disk subsystem. In the following, we note some major fault tolerance mechanisms:
Using LUN masking, the access of other hosts to a virtual space is controlled/restricted that prevents unintentional data modification/remove by other hosts.