In linux there is a tool called SMARTCTL which talks to the SMART modules on disks. Most current drives come with SMART. Its a chip on a disk that keeps diagnostics on a drive so that you can tell when it goes bad. It can also do drive tests that run independent of the computers CPU.

First download SMARTCTL:

apt-get install smartmontools

Find out the drive letter of your drive:

cat /proc/partition

1) Reading the Attribute Values

Run this next command & close to the top you will see a list of attributes and their values.

smartctl -a /dev/sda

Here is the list of attributes that I look for.

* I look for ATA errors. More than 1 ATA error and I want to replace the drive. Some will argue that not all ATA errors are concern to replace the drive, and they are correct, in fact you can scroll down the output of smartctl -a to see when the last ATA error took place (at what Power_on hour). Then check if the power_on hour is recent compared to the current Power_on hour attribute. ATA errors dont show up, unless you have some. So if you dont have any you will not see ATA errors in your output.
* More than 50 Reallocated Sectors and I want to replace the drive.
* If you have Offline_Uncorrectable and Current_Pending_Sector growing at the same time, that’s usually a bad sign (so replace the drive)
* Personally if I have any Offline_Uncorrectable or Current_Pending_Sector, I consider those bad drives.

NOTE: these attributes accumulate on the drive over time from its use
NOTE: you can run smartctl -a /dev/sda without having to unmount the drive. The drive can be used.

I use these commands to look at all sd# and hd# drives (recall drives can start with sd or hd – sidenote: on other systems you could have other letters for drives):

(### sd drives ###
for i in a b c d e f g h i j k l m n o p q r s t u v w x y z; do echo -n '- sd'$i; smartctl -a /dev/sd$i | egrep -i 'serial number|device model|reallocated_sec|ata error|power_on|current_pending|uncorrectable|capacity' | awk '/Device Model|Serial Number/{printf ":"$NF}{ if(NR>3 && NR<8){printf "("$2": "$NF")";}if(NR==8){printf "(ATAerrs: "$4")";}}/Capacity/{printf "("$3" bytes)"}/END/{printf "\n"}' | grep .; done | grep "(" 2>&1 | sed 's/Reallocated_Sector_Ct/RlSec/g;s/Power_On_Hours/PowHrs/g;s/Current_Pending_Sector/CurrPenSec/g;s/Offline_Uncorrectable/OffUncor/g;s/grep: Invalid regular expression//g'
### hd drives ###
for i in a b c d e f g h i j k l m n o p q r s t u v w x y z; do echo -n '- hd'$i; smartctl -a /dev/hd$i | egrep -i 'serial number|device model|reallocated_sec|ata error|power_on|current_pending|uncorrectable|capacity' | awk '/Device Model|Serial Number/{printf ":"$NF}{ if(NR>3 && NR<8){printf "("$2": "$NF")";}if(NR==8){printf "(ATAerrs: "$4")";}}/Capacity/{printf "("$3" bytes)"}/END/{printf "\n"}' | grep .; done | egrep "\(" 2>&1 | sed 's/Reallocated_Sector_Ct/RlSec/g;s/Power_On_Hours/PowHrs/g;s/Current_Pending_Sector/CurrPenSec/g;s/Offline_Uncorrectable/OffUncor/g;s/grep: Invalid regular expression//g')

NOTE: if you have a marvell unit (like its running ARM processors) you might need to add “-d marvell” to your smartctl command to get results from smartctl to show up.

NOTE: output may vary and might not work as the future or older versions of smartctl might line up the output different.

2) Using Self Tests

You can run online self tests like the “short” test which will take upto 5 minutes to run per drive. Each drive can run its own test using its own SMART chip & therefore you can launch them all in parallel.

Run smartctl -a /dev/sda, and note how long it says a short test will take. Usually from a minute to 5 minutes.

smartctl -t short /dev/sda

NOTE: the command will give control back to the user immediately as the command runs the test in the background. The drives SMART chip will run the tests.

NOTE: you can run these tests while the system is running. Its only the offline tests that need the drive to be offline.

Then after a minute. Run “smartctl -a /dev/sda” you will see the progress and the results if its done. Look at the power_on hours attribute first to see the current life hour of the drive. Then make sure the self-test results your reading have the same hour as that will let you know the test are from now.

If a drive is bad you will see its errors

NOTE: if you repeat “smartctl -a /dev/sda” while its running you will see the progress going from 100% to 0%. I dont know why they chose the reverse direction.

NOTE: I noticed the results of the self-tests all begin with a pound sign.

Run tests on all drives that cat /proc/partitions detects (this is not dangerous as they are just simple short tests that are run online):

# Run SMARTCTL short test
(for i in `cat /proc/partitions | grep -E "(sd|hd)[^0-9]$" | awk '{print $4}'`; do
echo "- $i";
smartctl -t short /dev/$i
done;)

See the results of the short tests of all the drives (notice Im just grepping for power_on hours and the pound sign):

# Check SMARTCTL test results
(for i in `cat /proc/partitions | grep -E "(sd|hd)[^0-9]$" | awk '{print $4}'`; do
echo "- $i";
smartctl -a /dev/$i | grep -E 'Power_On|^# ' | head -n 3
done;)

NOTE: if you have a marvell unit (like its running ARM processors) you might need to add “-d marvell” to your smartctl command to get smartctl to do anything.

NOTE: output may vary and might not work as the future or older versions of smartctl might line up the output different.

3) Dmesg

Dmesg can be full of great information about a drive. If you plug in a drive and it shows up incorrectly on “cat /proc/partitions” I would immediately go to dmesg. It will tell you things like if the controller is properly communicating to it. It will tell you about other ATA errors that are not recorded in the drive SMART, because it couldnt reach the drive SMART. Run dmesg on a system which has good drives and learn the output. You will see clear differences in the dmesg output when a drive is not working properly, or when a controller is not working properly.

Remember sometimes its not a bad drive, but a bad slot or controller or motherboard. Sometimes its wise to test a drive in multiple systems. So if a drive is not showing up on the system, dont immediately assume its a bad drive as it could be a bad controller. However is the drive loads up and you can see SMART information then you can rely on the SMART info to see if the drive is bad and at that point you can rule out the controller ever being bad. Remember You can rely on drive SMART tests.

What to do with a bad drive?

If its in a good RAID then simply replace the drive, the other parity drives will sync up the data to the new drive. If its a lonely drive in its own volume, Get a new drive, that is the same size or bigger and try to clone the bad drive to the new drive. Use gnu ddrescure (gddrescue) when your cloning. Its command name is: ddrescue. You dont want to use the regular / old ddrescue which has the command name: dd_rescue. Here is an article on it: ddrescue article 1ddrescue article 2, ddrescue article 3

Leave a Reply

Your email address will not be published. Required fields are marked *