ali

Posts Tagged ‘hardware’

Why ECC is necessary

In Uncategorized on February 4, 2010 at 11:16 pm

After spending a few hours trying to debug why a certain program would crash. It turns out that the memory on this particular system is bad.

It’s bad in such a way that Linux boots up, init runs, most of our startup succeeds, yet the first program with a large memory footprint fails.

Here are some successive runs of md5sum:

5b9e3287070f2f117adf77da90a0a5f7 flash-3002-1.sh.gz
2ed0de72c3bcbb7c882588781fbff676 flash-3002-1.sh.gz
292ea7da110c63d0014f2d1f804ed3c8 flash-3002-1.sh.gz

While dmesg shows that:

ERROR DDR0 ECC: 3 Single bit corrections, 1 Double bit errors
DDR0 ECC:       Failing dimm:   0
DDR0 ECC:       Failing rank:   1
DDR0 ECC:       Failing bank:   1
DDR0 ECC:       Failing row:    0x3123
DDR0 ECC:       Failing column: 0xee0
Advertisements