[Yaffs] bit error rates

Fri Feb 10 17:45:26 GMT 2006

>> Does this mean that if there is a single bit error corrected by the ECC
>> algo YAFFS retire the block and mark it as BAD?
>
> Yes.
>> "Although random bit errors may occur during use,
>> this does not necessarily mean that a block is bad.
>> Generally, a block should be marked as bad only if
>> there is a program or erase failure"
>
> That is true. This is what they recommend.

I am surprised that you are not following their recommendations.  I
would imagine that second-guessing the engineers at Toshiba might not
produce desirable results.. :-)

> YAFFS is being more cautious than that, which means that in theory YAFFS will
> lose blocks faster than they recommend. However from accelerated lifetime
> testing I've done, I have not seen this to be a practical problem since ECC
> errors are so rare once the problematic blocks have been removed.

I hope that when you say "lifetime testing" you are refering to
continuous read access to the device across a variety of environment
conditions including temperature, voltage, and background radiation...
and even still I'm not sure how you could claim "accelerated" since
some customer applications pretty much access the flash continously
anyway.

Anyway, the implication of retiring bad blocks on corrected read-ecc
errors is that it is possible to get a lot of retired blocks doing
nothing but reading, and without any actual unrecoverable error.

>
> It also means that in theory YAFFS is likely to be more secure than something

By "secure" I hope you really mean "reliable", that is, less likely to
lose data?  I doubt retiring blocks early would protect the user's
data against unauthorized access, which is the image that first came
into my head when I read this.

> designed the way they recommend. I would be concerned that by the time you
> start getting programming errors you might be exposing yourself to data loss.

It sounds like you are trying to predict future unrecoverable failures
based on past recoverable errors.  I don't see anything in the
datasheet that suggests that there is any correlation.  Lacking that
correlation, by the time a read disturb occurs, any potential data
loss has already happened.

>
> Most of the ECC handling etc for NAND was designed in the old days (256-byte
> pages etc) when NAND was quite flaky and 1-bit errors were relatively common.
> These days NAND is far better and far less likely to give ECC errors except
> on a few "soft" blocks.

Let's just hope that we don't run into any batches (at the bargain
bin?) which are still within published tolerances but happen to have a
lot of harmless read disturbs...

>
> It would be quite simple to change the retirement policy, but I'd like to see
> evidence that it is safe to do so first.

I would say the recommendation of Toshiba in their datasheet would be
sufficient evidence.

>
> I believe there are some Toshiba guys on the list. I'd like to hear their
> opinions on or off list.
>

Seconded...