Continuing the thread about bit error rates...
In some stress testing on a prototype about a year ago, we detected about
one soft error every day or two. (We were using an early private license
YAFFS2, with 256 MB Samsung 2kB page NAND chips, if I recall correctly.) We
had a low-level test program running. It wrote one page, read the entire
NAND, wrote another page, read the entire NAND, etc., as the only task, and
therefore at about the maximum data throughput of which our system was
capable, about 1 or 2 MB/second. A quick calculation: we have 86400 seconds
per day, so we had about 100 GB/day of data read, and perhaps 1 MB/day of
data written. We found this of concern, but not alarming, as our typical
usage pattern on a rugged hand-held with a commercial OS would be MUCH
lower. I don't recall that we identified read disturb or write disturb as
the culprit.
As far as I can recall (and as far as the engineer who did the testing can
recall), we did NOT see two errors on any one page. That is, each error
that we saw happened on a different page, and did NOT seem to indicate that
we had merely discovered a weak page.
The best option appeared to be to force a garbage collection of the data in
that block and erase the block. Block retirement seemed more appropriate in
cases where a write failed (indicating that a write operation did not
correctly change 1s to 0s on a page write), where an erase failed (the chip
could not restore 0s to 1s), or where a verification of the data written
failed. This approach would match the recommendations from Toshiba.
New thought: perhaps YAFFS could use two distinct ways of marking a block
bad, one for cases where it detected an ECC error and another for cases
where a write operation failed. If these were distinct from the
manufacturer's initial bad block marking, one coould determine which effect
caused which proportion of bad blocks in the system.
I will also note that a NAND vendor who paid us a visit at about that same
time said that we should expect WORSE soft error behaviour with succeeding
generations of NAND flash chips. The geometries would get smaller and
smaller, the chip dies would get larger and larger, and the amount of time
for production testing of each chip would not increase, or at least, not
increase as fast as the total storage of a chip. Thus, the testing per page
would only go down in subsequent generations of chips. These two statements
seemed to say that we would see both (1) increased rates of ECC errors, and
(2) an increase in the number of marginal blocks not marked bad by the chip
vendor.
As an aside, it seems to me that the ECC strategy used by YAFFS2 at present
is inefficient in the use of check bits. It provides SEC-DED (Single Error
Correction - Double Error Detection) within each 256 byte portion of a 2kB
page, independent of the other portions of a page, at a cost of roughly 24
bytes of check data (3 check bytes per 256 data bytes). Unfortunately, I am
not enough of a mathematician to be able to devise an ECC scheme with double
bit correction. Do we have any such person on this list? Perhaps this has
all moved off into the MTD, wirth which I am completely unfamiliar. The
present ECC scheme also may work well with existing hardware support, which
a new scheme would not be able to use.
Another obvious alternative strategy for preventing data loss due to
accumulation of multiple bit errors would be to periodically read the entire
data array, checking for ECC errors. You'd want to calculate the impact
that such reading would have on the rate of appearance of errors, as well as
the impact on system and NAND performance. For a standard file system, it
might suffice to perform one additional data chunk read for every N read
requests, incrementing the "scrub" page each time. This would ensure a
complete read scrub at a fixed percentage overhead. One could also perform
a read scrub every M write operations, if desired.
Regards,
William
--
wjw1961@gmail.com
William J. Watson