Wednesday, July 13, 2011

Unacknowledged problems with Intel SATA?

For a long time I've had to deal with ongoing 'flakiness' problems with the hard-drives in our Dell laptops. Recently, it got so bad that I decided I had to do something about it! The only clue I had was the large number of 'device timeout' errors in the Windows XP Event Viewer. I had previously been told (by multiple co-workers) that the messages were false/cosmetic. They were easy to check for by applying a filter to the System event-log for either source= 'atapi' or source= 'iastor'.


The first thing I figured-out was that some of the errors were cosmetic but only on our newer E-series laptops (the E6400 and E6410). They had always ever been setup with the older/generic version of the Intel SATA driver, 'iastor' v8.8. Once I upgraded these machines to the newer v9.6 the errors went away.

I also figured-out a relatively reliable way of testing for the error. Without explicitly testing for it, I was seeing the 'timeout' errors every 1-4 weeks under normal usage. If I left a machine logged-in and running overnight, however, it almost always reported new errors in the middle of the night.


Most importantly, I was able to now demonstrate that these errors were NOT cosmetic on any of our Latitude D630 laptops. I spent about a MONTH swapping HDs between different machines, reimaging them, using the latest Intel SATA driver versus the older/default, even re-installing in AHCI vs ATA mode. The results were conclusive: the timeout errors were from the motherboard. And, the problem only occurred on the machines I had purchased between January 2008 and October 2008, i.e., it really seemed to indicate a bad batch.

Unfortunately, by this point, most of my D630s were out-of-warranty; most by just a few months. Dell was willing to replace the motherboards on all the in-warranty machines, of course, but I had a real argument with them about the rest. I pointed out how the problem seemed to be a manufacturing defect and that it had taken me a LONG time to prove.  They ended-up granting me an additional 60 day 'grace' period on the warranties. So 7 of the 13 affected laptops are going to be repaired.


All of this got me thinking: Why do any of these machines show a SATA 'timeout' at all? I heard that the latest 'Sandy Bridge' chipset had been recalled because of an obvious defect in the SATA-2 controller. But is there a larger problem with Intel's chipsets? Why does the error occur on E-series laptops if you use an older revision of the Intel driver? Is it really a cosmetic error, or is it a universal problem which the newer version of the driver simply ignores?