Continuing the thread about bit error rates...
In some stress testing on a prototype about a year ago, we detected
about one soft error every day or two. (We were using an early
private license YAFFS2, with 256 MB Samsung 2kB page NAND chips, if I
recall correctly.) We had a low-level test program running.
It wrote one page, read the entire NAND, wrote another page, read the
entire NAND, etc., as the only task, and therefore at about the maximum
data throughput of which our system was capable, about 1 or 2
MB/second. A quick calculation: we have 86400 seconds per day, so
we had about 100
GB/day of data read, and perhaps 1 MB/day of data written. We
found this of concern, but not alarming, as our typical usage pattern
on a rugged hand-held with a commercial OS would be MUCH lower. I
don't recall that we identified read disturb or write disturb as the
culprit.
As far as I can recall (and as far as the engineer who did the testing
can recall), we did NOT see two errors on any one page. That is,
each error that we saw happened on a different page, and did NOT seem
to indicate that we had merely discovered a weak page.
The best option appeared to be to force a garbage collection of the
data in that block and erase the block. Block retirement seemed
more appropriate in cases where a write failed (indicating that a write
operation did not correctly change 1s to 0s on a page write), where an
erase failed (the chip could not restore 0s to 1s), or where a
verification of the data written failed. This approach would
match the recommendations from Toshiba.
New thought: perhaps YAFFS could use two distinct ways of marking a
block bad, one for cases where it detected an ECC error and another for
cases where a write operation failed. If these were distinct from
the manufacturer's initial bad block marking, one coould determine
which effect caused which proportion of bad blocks in the system.
I will also note that a NAND vendor who paid us a visit at about that
same time said that we should expect WORSE soft error behaviour with
succeeding generations of NAND flash chips. The geometries would
get smaller and smaller, the chip dies would get larger and larger, and
the amount of time for production testing of each chip would not
increase, or at least, not increase as fast as the total storage of a
chip. Thus, the testing per page would only go down in subsequent
generations of chips. These two statements seemed to say that we
would see both (1) increased rates of ECC errors, and (2) an increase
in the number of marginal blocks not marked bad by the chip vendor.
As an aside, it seems to me that the ECC strategy used by YAFFS2 at
present is inefficient in the use of check bits. It provides
SEC-DED (Single Error Correction - Double Error Detection) within each
256 byte portion of a 2kB page, independent of the other portions of a
page, at a cost of roughly 24 bytes of check data (3 check bytes per
256 data bytes). Unfortunately, I am not enough of a
mathematician to be able to devise an ECC scheme with double bit
correction. Do we have any such person on this list?
Perhaps this has all moved off into the MTD, wirth which I am
completely unfamiliar. The present ECC scheme also may work well
with existing hardware support, which a new scheme would not be able to
use.
Another obvious alternative strategy for preventing data loss due to
accumulation of multiple bit errors would be to periodically read the
entire data array, checking for ECC errors. You'd want to
calculate the impact that such reading would have on the rate of
appearance of errors, as well as the impact on system and NAND
performance. For a standard file system, it might suffice to
perform one additional data chunk read for every N read requests,
incrementing the "scrub" page each time. This would ensure a
complete read scrub at a fixed percentage overhead. One could
also perform a read scrub every M write operations, if desired.
Regards,
William
--
wjw1961@gmail.com
William J. Watson