Analyzing a Faulty Hard Disk using Smartctl

From Thomas-Krenn-Wiki
Jump to: navigation, search

Under Linux, you can read the SMART (Self-Monitoring, Analysis and Reporting Technology) information from the hard disk using smartctl. In this example, we will show how to analyze a defective hard disk. The hard disk in this example can no longer read several sectors and is therefore defective. It has to be replaced.

Displaying SMART Information

The smartctl -a /dev/DEVICENAME command will display all SMART information for the affected hard disk. The hard disk in this example is showing increased errors for multiple SMART settings.

root@ubuntu-10-10:~# smartctl -a /dev/sdb
smartctl 5.40 2010-03-16 r3077 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F2 EG series
Device Model:     SAMSUNG HD502HI
Serial Number:    S1VZJ9CS712490
Firmware Version: 1AG01118
User Capacity:    500,107,862,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 3b
Local Time is:    Wed Feb  9 15:30:42 2011 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:          (6312) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 106) minutes.
Conveyance self-test routine
recommended polling time:      (  12) minutes.
SCT capabilities:            (0x003f)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   099   099   051    Pre-fail  Always       -       2376
  3 Spin_Up_Time            0x0007   091   091   011    Pre-fail  Always       -       3620
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       405
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   253   253   051    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       717
 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       405
 13 Read_Soft_Error_Rate    0x000e   099   099   000    Old_age   Always       -       2375
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   000    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       2375
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   084   074   000    Old_age   Always       -       16 (Lifetime Min/Max 16/16)
194 Temperature_Celsius     0x0022   084   071   000    Old_age   Always       -       16 (Lifetime Min/Max 16/16)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       3558
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   098   098   000    Old_age   Always       -       81
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       1
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   253   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@ubuntu-10-10:~#

Analysis

In this example, the following values are interesting for the detailed analysis.

  1 Raw_Read_Error_Rate     0x000f   099   099   051    Pre-fail  Always       -       2376
 13 Read_Soft_Error_Rate    0x000e   099   099   000    Old_age   Always       -       2375
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       2375
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       3558
197 Current_Pending_Sector  0x0012   098   098   000    Old_age   Always       -       81
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       1

The RAW_VALUE of the Current_Pending_Sector value indicates how many of the hard disk’s sectors can no longer be read and are waiting for re-mapping.[1] You will find detailed information about the other error codes in the ATA S.M.A.R.T. Attributes section of the Wikipedia article about SMART.[2]

SMART Tests

SMART supports several hard disk tests. You can find the details on the man page for smartctl.

Short Test

We will start a short test in this example.

root@ubuntu-10-10:~# smartctl -t short /dev/sdb
smartctl 5.40 2010-03-16 r3077 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Wed Feb  9 15:35:31 2011

Use smartctl -X to abort test.
root@ubuntu-10-10:~#

Displaying the Test Results

The test results will be displayed by the command: smartctl -l selftest /dev/sdb. The LBA address is obviously the first defective sector in this example.

root@ubuntu-10-10:~# smartctl -l selftest /dev/sdb
smartctl 5.40 2010-03-16 r3077 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       20%       717         555027747

root@ubuntu-10-10:~#

Forcing Re-mapping a Defective Sector

When you write to a defective sector, the hard disk will attempt to re-map the affected sector. The original content of the sector will be lost by this procedure. You will find details about this on the Bad Block HOWTO page.[3]

The following command will display the remapping process for a sector. The Current_Pending_Sector counter will be reduced (these steps were performed according to the Bad Block HOWTO page).

root@ubuntu-10-10:~# fdisk -lu /dev/sdb

Disk /dev/sdb: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x20d1585d

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *        2048      206847      102400    7  HPFS/NTFS
Partition 1 does not end on cylinder boundary.
/dev/sdb2          206848    97863097    48828125    7  HPFS/NTFS
/dev/sdb3        97868041   976768064   439450012    5  Extended
/dev/sdb5        97868043   964703249   433417603+  83  Linux
/dev/sdb6       964703313   976768064     6032376   82  Linux swap / Solaris
root@ubuntu-10-10:~# tune2fs -l /dev/sdb5 | grep Block
Block count:              108354400
Block size:               4096
Blocks per group:         32768
root@ubuntu-10-10:~# debugfs 
debugfs 1.41.12 (17-May-2010)
debugfs:  open /dev/sdb5
debugfs:  testb 57144963
Block 57144963 not in use
debugfs:  quit
root@ubuntu-10-10:~# dd if=/dev/zero of=/dev/sdb5 bs=4096 count=1 seek=57144963
1+0 records in
1+0 records out
4096 bytes (4,1 kB) copied, 0,000379164 s, 10,8 MB/s
root@ubuntu-10-10:~# sync
root@ubuntu-10-10:~# smartctl -A /dev/sdb
smartctl 5.40 2010-03-16 r3077 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
[...]
197 Current_Pending_Sector  0x0012   098   098   000    Old_age   Always       -       80
[...]

Another set of tests will be started and another remapping procedure will be performed.

root@ubuntu-10-10:~# smartctl -t short /dev/sdb
smartctl 5.40 2010-03-16 r3077 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Wed Feb  9 15:47:41 2011

Use smartctl -X to abort test.
root@ubuntu-10-10:~# smartctl -l selftest /dev/sdb
smartctl 5.40 2010-03-16 r3077 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       20%       717         555027784
# 2  Short offline       Completed: read failure       20%       717         555027747

root@ubuntu-10-10:~# debugfs 
debugfs 1.41.12 (17-May-2010)
debugfs:  open /dev/sdb5
debugfs:  testb 57144967
Block 57144967 not in use
debugfs:  quit
root@ubuntu-10-10:~# dd if=/dev/zero of=/dev/sdb5 bs=4096 count=1 seek=57144967
1+0 records in
1+0 records out
4096 bytes (4,1 kB) copied, 0,000374713 s, 10,9 MB/s
root@ubuntu-10-10:~# smartctl -A /dev/sdb
smartctl 5.40 2010-03-16 r3077 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
[...]
197 Current_Pending_Sector  0x0012   098   098   000    Old_age   Always       -       79
[...]

References

  1. 9133: S.M.A.R.T. Attribute: Current Pending Sector Count (Acronis Knowledge Base)
  2. S.M.A.R.T. (en.wikipedia.org)
  3. Bad block HOWTO for smartmontools

Related articles

SMART tests with smartctl
Smartctl
Smartmontools with MegaRAID Controller