[Yaffs] YAFFS behavior to soft ECC error confirmed incorrect

Top Page
Attachments:
Message as email
+ (text/plain)
Delete this message
Reply to this message
Author: Michael Arm
Date:  
To: yaffs
CC: manningc2
Subject: [Yaffs] YAFFS behavior to soft ECC error confirmed incorrect
Greetings all,

There was some discussion a while back as to the behavior of YAFFS2
upon encountering a "soft" (correctable) ECC error. There was also a
call for information from any NAND flash vendors on the list to
clarify the characteristics of the failure modes in NAND flash. I
have just recently been in contact with someone qualified to describe
these failure modes, and have come away with the understanding that
the YAFFS2 policy of retiring blocks on soft-ECC error is incorrect.

Here's why. NAND flash in general has the characteristic that when it
is read there is a slight chance for what is called a "read disturb".
This is where a single bit will permanently flip just from being read.
The average number of reads that will occur before such a disturb
occurs is on the order of 100k. However when this error occurs, the
correct behavior is to erase the block and re-use it. Then the block
will be good (statistically speaking) for around another 100k reads.
The fact that a soft-ECC error (or even a multiple bit error) occured
says nothing about the likihood of the sector going bad, it only says
that the 200,000(**) sided die rolled on that read and came up 0.
(**) Figuratively speaking. Maybe it's actually got 202,546 sides, I
don't know.

However, each time a read-program-erase cycle is performed the number
of reads that will occur until a soft ECC error occurs will decrease.
So after 100k cycles one may only be able to read on average 100 or
1,000 times before an ECC error will occur. This last number I'm just
pulling out of the air, but the point is, if there are a lot of
read-program-erase cycles, the current behavior of YAFFS2 to retire
the block is seen as no longer conservative, but rather will greatly
accelerate the degradation of the flash until all the blocks on the
flash are marked as bad - even though the device has a substantial
amount of life left in it.

This behavior is not going to improve with succeeding generations of
flash, in fact it is definitely going to get worse as die sizes
shrink. Therefore I propose that YAFFS2 adopt a policy of copying a
block to a new location whenever a soft ECC error occurs, and then
mark the block for GC.

Best Regards,

Michael