How To Protect Yourself From Harddisk Crash & Failures

By Angsuman Chakraborty, Gaea News Network
Monday, January 15, 2007

Most modern hard disks have S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology) technology built in which, if enabled, allows you to query the hard drive about it’s health and performance. Let’s look at some of the critical attributes and how you can determine the health of your hard disk.

Mechanical failures, which are usually predictable failures, account for 60 percent of drive failure. The purpose of S.M.A.R.T. is to warn a user or system administrator of impending drive failure by mechanical means, while time remains to take preventive action such as copying the data to a replacement device, taking regular backups etc. Approximately 30% of failures can be predicted by S.M.A.R.T.

Note: Most modern drives support S.M.A.R.T. However drives connected via SCSI or hardware RAID will not work. Drives connected via SATA (serial ATA) are supported as are drives configured as software RAID (dynamic disks) via Windows Disk Management will also work.

Each drive manufacturer defines a set of attributes and selects threshold values which attributes should not go below under normal operation. Attribute values can range from 1 to 253 (1 representing the worst case and 253 representing the best). Depending on the manufacturer, a value of 100 or 200 will often be chosen as the “normal” value.

S.M.A.R.T. is supported by majority of hard disk manufacturers including but not limited to Samsung, Seagate, IBM (Hitachi), Fujitsu, Maxtor and Western Digital.

They do not necessarily agree on precise attribute definitions and measurement units; therefore the following list of critical attributes should be regarded as a general reference only.

Overview of critical S.M.A.R.T. attributes and their description

ID Hex Attribute name Description
01 01 Read Error Rate Indicates the rate of hardware read errors that occurred when reading data from a disk surface. Lower values indicate a problem with either disk surface or read/write heads.
05 05 Reallocated Sectors Count Count of reallocated sectors. When the hard drive finds a read/write/verification error, it marks this sector as “reallocated” and transfers data to a special reserved area (spare area). This process is also known as remapping and “reallocated” sectors are called remaps. This is why, on modern hard disks, you can not see “bad blocks” while testing the surface — all bad blocks are hidden in reallocated sectors. However, the more sectors that are reallocated, the more read/write speed will decrease.
06 06 Read Channel Margin Margin of a channel while reading data. The function of this attribute is not specified.
196 C4 Reallocation Event Count Count of remap operations. The raw value of this attribute shows the total number of attempts to transfer data from reallocated sectors to a spare area. Both successful & unsuccessful attempts are counted.
197 C5 Current Pending Sector Count Number of “unstable” sectors (waiting to be remapped). If the unstable sector is subsequently written or read successfully, this value is decreased and the sector is not remapped. Read errors on the sector will not remap the sector, it will only be remapped on a failed write attempt. This can be problematic to test because cached writes will not remap the sector, only direct I/O writes to the disk.
198 C6 Uncorrectable Sector Count The total number of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem.
220 DC Disk Shift Distance the disk has shifted relative to the spindle (usually due to shock). Unit of measure is unknown.

There are several free and commercial tools available to determine the health of your hard disk.

I prefer HDD Health from Panterasoft. It provides detailed listing of detected S.M.A.R.T. attributes, including ones it couldn’t decipher. You can use the table above to get an understanding of the impact of the parameters for your hard disks. It can also send you notifications by emails, network messages, popups and sound before impending hard disk failures.

A safe corporate strategy is to use S.M.A.R.T. to manage your hard disks across all machines by using a S.M.A.R.T. aware tool to get centrally notified of impending failures and prepare for contingencies. Monitoring the server machines is of critical importance. For Linux servers I would recommend smartmontools and utilities based on it such as for web view.

will not be displayed