Continuing the thread about bit error rates...

In some stress testing on a prototype about a year ago, we detected about one soft error every day or two.  (We were using an early private license YAFFS2, with 256 MB Samsung 2kB page NAND chips, if I recall correctly.)  We had a low-level test program running.  It wrote one page, read the entire NAND, wrote another page, read the entire NAND, etc., as the only task, and therefore at about the maximum data throughput of which our system was capable, about 1 or 2 MB/second.  A quick calculation: we have 86400 seconds per day, so we had about 100 GB/day of data read, and perhaps 1 MB/day of data written.  We found this of concern, but not alarming, as our typical usage pattern on a rugged hand-held with a commercial OS would be MUCH lower.  I don't recall that we identified read disturb or write disturb as the culprit.

As far as I can recall (and as far as the engineer who did the testing can recall), we did NOT see two errors on any one page.  That is, each error that we saw happened on a different page, and did NOT seem to indicate that we had merely discovered a weak page.

The best option appeared to be to force a garbage collection of the data in that block and erase the block.  Block retirement seemed more appropriate in cases where a write failed (indicating that a write operation did not correctly change 1s to 0s on a page write), where an erase failed (the chip could not restore 0s to 1s), or where a verification of the data written failed.  This approach would match the recommendations from Toshiba.


New thought: perhaps YAFFS could use two distinct ways of marking a block bad, one for cases where it detected an ECC error and another for cases where a write operation failed.  If these were distinct from the manufacturer's initial bad block marking, one coould determine which effect caused which proportion of bad blocks in the system.


I will also note that a NAND vendor who paid us a visit at about that same time said that we should expect WORSE soft error behaviour with succeeding generations of NAND flash chips.  The geometries would get smaller and smaller, the chip dies would get larger and larger, and the amount of time for production testing of each chip would not increase, or at least, not increase as fast as the total storage of a chip.  Thus, the testing per page would only go down in subsequent generations of chips.  These two statements seemed to say that we would see both (1) increased rates of ECC errors, and (2) an increase in the number of marginal blocks not marked bad by the chip vendor.


As an aside, it seems to me that the ECC strategy used by YAFFS2 at present is inefficient in the use of check bits.  It provides SEC-DED (Single Error Correction - Double Error Detection) within each 256 byte portion of a 2kB page, independent of the other portions of a page, at a cost of roughly 24 bytes of check data (3 check bytes per 256 data bytes).  Unfortunately, I am not enough of a mathematician to be able to devise an ECC scheme with double bit correction.  Do we have any such person on this list?  Perhaps this has all moved off into the MTD, wirth which I am completely unfamiliar.  The present ECC scheme also may work well with existing hardware support, which a new scheme would not be able to use.


Another obvious alternative strategy for preventing data loss due to accumulation of multiple bit errors would be to periodically read the entire data array, checking for ECC errors.  You'd want to calculate the impact that such reading would have on the rate of appearance of errors, as well as the impact on system and NAND performance.  For a standard file system, it might suffice to perform one additional data chunk read for every N read requests, incrementing the "scrub" page each time.  This would ensure a complete read scrub at a fixed percentage overhead.  One could also perform a read scrub every M write operations, if desired.

Regards,

William

--
wjw1961@gmail.com
William J. Watson