Why ECC is necessary

February 4, 2010

After spending a few hours trying to debug why a certain program would crash. It turns out that the memory on this particular system is bad.

It’s bad in such a way that Linux boots up, init runs, most of our startup succeeds, yet the first program with a large memory footprint fails.

Here are some successive runs of md5sum:


While dmesg shows that:

ERROR DDR0 ECC: 3 Single bit corrections, 1 Double bit errors
DDR0 ECC:       Failing dimm:   0
DDR0 ECC:       Failing rank:   1
DDR0 ECC:       Failing bank:   1
DDR0 ECC:       Failing row:    0x3123
DDR0 ECC:       Failing column: 0xee0
