Disk errors

I was getting these errors from one of my new hard disks:

Feb  3 00:16:07 orac kernel: [78407.504324] ata3.01: exception Emask 0x0 SAct 0x
0 SErr 0x0 action 0x0
Feb  3 00:16:07 orac kernel: [78407.504610] ata3.01: BMDMA stat 0x64
Feb  3 00:16:07 orac kernel: [78407.504881] ata3.01: failed command: READ DMA
Feb  3 00:16:07 orac kernel: [78407.505162] ata3.01: cmd c8/00:08:98:0f:c1/00:00
:00:00:00/f0 tag 0 dma 4096 in
Feb  3 00:16:07 orac kernel: [78407.505163]          res 51/40:08:98:0f:c1/00:00
:00:00:00/f0 Emask 0x9 (media error)
Feb  3 00:16:07 orac kernel: [78407.505722] ata3.01: status: { DRDY ERR }
Feb  3 00:16:07 orac kernel: [78407.506002] ata3.01: error: { UNC }
Feb  3 00:16:08 orac kernel: [78407.781740] ata3.00: configured for UDMA/133
Feb  3 00:16:08 orac kernel: [78407.801565] ata3.01: configured for UDMA/133
Feb  3 00:16:08 orac kernel: [78407.801578] ata3: EH complete

So I searched for a solution. I found [ubuntu] Hard Drive Error : ata3.00: status: { DRDY ERR } and in there hobong says:

It’s Kernel Bug on ata ACPI. I put “options libata noacpi=1” on /etc/modprobe.d/options and the ERROR is gone.

This is supplemented by a later comment from thatmattbone:

I think in 9.10, any file ending in “.conf” in /etc/modprobe.d is parsed. I created a new file, /etc/modprobe.d/options.conf and put the “options libata noacpi=1” in there.

So I created /etc/modprobe.d/options.conf with the content “options libata noacpi=1” and then I rebooted.

Upon reboot the disk was recognised as containing erros and fsck was forced. I had the opportunity to cancel but I let it run. While it was running a whole heap of the same original errors came through. I’m not sure if that was because the /etc/modprobe.d/options.conf file hadn’t done the trick, or if it was because it was too early in the boot process and /etc/modprobe.d/options.conf hadn’t been processed yet.

Anyway, I needed to try and fix this problem, so I ran lshw -C disk to see what I could see and found the following:

root@orac:~# lshw -C disk

  *-disk:0
       description: ATA Disk
       product: ST32000644NS
       vendor: Seagate
       physical id: 0.0.0
       bus info: scsi@2:0.0.0
       logical name: /dev/sda
       version: SN12
       serial: 9WM67R7A
       size: 1863GiB (2TB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=9302d195-5ffc-41f2-949f-2899017a4dc0
  *-disk:1
       description: ATA Disk
       product: SAMSUNG HD204UI
       physical id: 0.1.0
       bus info: scsi@2:0.1.0
       logical name: /dev/sdb
       version: 1AQ1
       serial: S2K4J1CBA13712
       size: 1863GiB (2TB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 signature=91cd6331

As you can see, my new disk, sdb, was reported with different capabilities than my old disk, and my old disk seemed to be working fine. so I figured I’d have a look into that.

Turns out that fdisk creates MBR partition tables, but there’s a newer scheme known as GUID Partition Table or just GPT.

There are tools for working with GPT partition tables on Linux, notably GPT fdisk which comes with the command-line tool gdisk. The gdisk utility wasn’t available on my system, but I was able to install it with apt-get:

root@orac:~# apt-get install gdisk

Then I ran gdisk on my broken disk and it reported MBR only:

root@orac:~# gdisk /dev/sdb
GPT fdisk (gdisk) version 0.5.1

Partition table scan:
  MBR: MBR only
  BSD: not present
  APM: not present
  GPT: not present


***************************************************************
Found invalid GPT and valid MBR; converting MBR to GPT format.
THIS OPERATON IS POTENTIALLY DESTRUCTIVE! Exit by typing 'q' if
you don't want to convert your MBR partitions to GPT format!
***************************************************************

Warning! Secondary partition table overlaps the last partition by 33 blocks
You will need to delete this partition or resize it in another utility.

Command (? for help): q

Also you will notice that last warning, about there being something dodgy with the secondary partition table overlapping the last partition. Maybe these issues were related to the errors I was getting? I doubt it, but who knows.

Anyway, I decided to put a new GPT partition on my new disk and reformat the whole thing in the hope that I could get it to work.

I ran gdisk on my good disk to see what types of partitions it had:

root@orac:~# gdisk /dev/sda
GPT fdisk (gdisk) version 0.5.1

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.

Command (? for help): ?
b       back up GPT data to a file
c       change a partition's name
d       delete a partition
i       show detailed information on a partition
l       list known partition types
n       add a new partition
o       create a new empty GUID partition table (GPT)
p       print the partition table
q       quit without saving changes
r       recovery and transformation options (experts only)
s       sort partitions
t       change a partition's type code
v       verify disk
w       write table to disk and exit
x       extra functionality (experts only)
?       print this menu

Command (? for help): p
Disk /dev/sda: 3907029168 sectors, 1.8 TiB
Disk identifier (GUID): 9302D195-5FFC-41F2-949F-2899017A4DC0
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 3907029134
Total free space is 1756 sectors (878.0 KiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1              34      3891402377   1.8 TiB     EF00
   2      3891402378      3907027378   7.5 GiB     8200

Command (? for help): q

Note that the primary partition was using code EF00. The following table explains that EF00 is “EFI System”, but I’m not sure what that means.

0700 Linux/Windows data   0c01 Microsoft Reserved   2700 Windows RE
4200 Windows LDM data     4201 Windows LDM metadat  8200 Linux swap
8301 Linux Reserved       8e00 Linux LVM            a500 FreeBSD disklabel
a501 FreeBSD boot         a502 FreeBSD swap         a503 FreeBSD UFS
a504 FreeBSD ZFS          a505 FreeBSD Vinum/RAID   a800 Apple UFS
a901 NetBSD swap          a902 NetBSD FFS           a903 NetBSD LFS
a903 NetBSD RAID          a904 NetBSD concatenated  a905 NetBSD encrypted
ab00 Apple boot           af00 Apple HFS/HFS+       af01 Apple RAID
af02 Apple RAID offline   af03 Apple label          af04 AppleTV recovery
be00 Solaris boot         bf00 Solaris root         bf01 Solaris /usr & Mac
bf02 Solaris swap         bf03 Solaris backup       bf04 Solaris /var
bf05 Solaris /home        bf05 Solaris EFI_ALTSCTR  bf06 Solaris Reserved 1
bf07 Solaris Reserved 2   bf08 Solaris Reserved 3   bf09 Solaris Reserved 4
bf0a Solaris Reserved 5   c001 HP-UX data           c002 HP-UX service
ef00 EFI System           ef01 MBR partition schem  ef02 BIOS boot partition
fd00 Linux RAID

In any event I decided that I would create my new partition as an EFI System too. So I did that:

root@orac:~# gdisk /dev/sdb
GPT fdisk (gdisk) version 0.5.1

Partition table scan:
  MBR: MBR only
  BSD: not present
  APM: not present
  GPT: not present


***************************************************************
Found invalid GPT and valid MBR; converting MBR to GPT format.
THIS OPERATON IS POTENTIALLY DESTRUCTIVE! Exit by typing 'q' if
you don't want to convert your MBR partitions to GPT format!
***************************************************************

Warning! Secondary partition table overlaps the last partition by 33 blocks
You will need to delete this partition or resize it in another utility.

Command (? for help): ?
b       back up GPT data to a file
c       change a partition's name
d       delete a partition
i       show detailed information on a partition
l       list known partition types
n       add a new partition
o       create a new empty GUID partition table (GPT)
p       print the partition table
q       quit without saving changes
r       recovery and transformation options (experts only)
s       sort partitions
t       change a partition's type code
v       verify disk
w       write table to disk and exit
x       extra functionality (experts only)
?       print this menu

Command (? for help): o
This option deletes all partitions and creates a new protective MBR.
Proceed? (Y/N): y

Command (? for help): n
Partition number (1-128, default 1):
First sector (34-3907029134, default = 34) or {+-}size{KMGT}:
Last sector (34-3907029134, default = 3907029134) or {+-}size{KMGT}:
Current type is 'Unused entry'
Hex code (L to show codes, 0 to enter raw code): EF00
Changed system type of partition to 'EFI System'

Command (? for help): l
0700 Linux/Windows data   0c01 Microsoft Reserved   2700 Windows RE
4200 Windows LDM data     4201 Windows LDM metadat  8200 Linux swap
8301 Linux Reserved       8e00 Linux LVM            a500 FreeBSD disklabel
a501 FreeBSD boot         a502 FreeBSD swap         a503 FreeBSD UFS
a504 FreeBSD ZFS          a505 FreeBSD Vinum/RAID   a800 Apple UFS
a901 NetBSD swap          a902 NetBSD FFS           a903 NetBSD LFS
a903 NetBSD RAID          a904 NetBSD concatenated  a905 NetBSD encrypted
ab00 Apple boot           af00 Apple HFS/HFS+       af01 Apple RAID
af02 Apple RAID offline   af03 Apple label          af04 AppleTV recovery
be00 Solaris boot         bf00 Solaris root         bf01 Solaris /usr & Mac
bf02 Solaris swap         bf03 Solaris backup       bf04 Solaris /var
bf05 Solaris /home        bf05 Solaris EFI_ALTSCTR  bf06 Solaris Reserved 1
bf07 Solaris Reserved 2   bf08 Solaris Reserved 3   bf09 Solaris Reserved 4
bf0a Solaris Reserved 5   c001 HP-UX data           c002 HP-UX service
ef00 EFI System           ef01 MBR partition schem  ef02 BIOS boot partition
fd00 Linux RAID

Command (? for help): p
Disk /dev/sdb: 3907029168 sectors, 1.8 TiB
Disk identifier (GUID): 71584326-3AD4-0BD9-A98A-9173A1FCF308
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 3907029134
Total free space is 0 sectors (0 bytes)

Number  Start (sector)    End (sector)  Size       Code  Name
   1              34      3907029134   1.8 TiB     EF00  EFI System

Command (? for help): w

Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
MBR PARTITIONS!! THIS PROGRAM IS BETA QUALITY AT BEST. IF YOU LOSE ALL YOUR
DATA, YOU HAVE ONLY YOURSELF TO BLAME IF YOU ANSWER 'Y' BELOW!

Do you want to proceed, possibly destroying your data? (Y/N) y
OK; writing new GPT partition table.
The operation has completed successfully.

Then I created my new ext4 file system on my new GPT partition:

root@orac:~# mkfs -t ext4 /dev/sdb1
mke2fs 1.41.11 (14-Mar-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
122101760 inodes, 488378637 blocks
24418931 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
14905 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
        102400000, 214990848

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 33 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

And I also lessened the percentage of blocks reserved for root to 1%:

root@orac:~# tune2fs -m 1 /dev/sdb1
tune2fs 1.41.11 (14-Mar-2010)
Setting reserved blocks percentage to 1% (4883786 blocks)

I would have liked to have set it to 0%, but that’s what I did last time and I decided to avoid doing that just in case that had in some way contributed to the errors I was getting (I doubt it, but better safe than sorry).

So then I put the following line in my /etc/fstab file:

/dev/sdb1 /mnt/airgap ext4  defaults  0 2

And then I was good to mount my new file system:

root@orac:~# mount /mnt/airgap

I’m in the process of copying about 1.6TB of data onto my newly minted disk, and it seems to be running OK at the moment. I guess it will be about a day or so before I know for sure if any of the above has helped.