Scott,
On Tuesday 24 March 2009, Wagner Scott (ST-IN/ENG1.1) wrote:
> >> [yaffs] retires the block at erase time
> >
> >I don't know about the real marking of bad blocks. We have
> >actually disabled this in some versions of products where we
> >were bitten by transient write errors causing large number of
> >blocks to be persistently marked bad (OOB) and taken out of
> >service.
>
> ... Meaning that in your experience it's OK to just defer to
> the "write fail" mechanism
In my experience, to 'defer to the "write fail" mechanism' is
better then loosing a large portion of NAND to incorrectly
marked bad blocks. A software reload can fix the former; it
takes a NAND 'repair' utility and crossed fingers to recover
from the latter.
> - if it fails a write to page <n>
> this time, then after erasure either it will fail the write to
> page <n> again, or if the write to page <n> happens to succeed
> then the data in page <n> is reliably OK. Right? (with some
> trepitation)
I don't know. I think it gets down to types of failure and
statistics. The NAND chip will make a judgment as to whether or
not a block is good (successful erase). It does this by cycling
the cells and measuring the charge/voltage on the cells.
Temperature, voltage, timing and noise will all play into the
equation when there is a border line case. One hopes that the
mechanism ensures that either the cells are functioning
sufficiently well to reliably hold data, or the erase (write?)
fails.
[Side note, on a write, the NAND chip only ensures that zero's
written are zero's on the CHIP -- it is not an error to write a
one to a cell that is zero. So a write 'success' indication
just assures a subset of the data has been written successfully.
There are config options for both Yaffs and MTD to perform a
read pass to verify a write, but it costs cycles of course]
> >> Is this right? If so, it seems OK as long as bad pages
> >> within an eraseblock does not imply unreliability of other
> >> pages within the same eraseblock.
> >
> >The logic around declaring a block truly bad and broken is
> >lacking (both Yaffs and MTD). IRCC, NAND vendors recommend
> > that blocks should be erased when there are write/read
> > errors, and only marked bad if the erase fails, and then
> > perhaps only after several attempts. Neither Yaffs nor MTD
> > to this.
>
> [and from a later message]
>
> >We'd need the NAND vendors to reveal that, but I think it
> >reasonable to suspect that if a block is improperly erased
> > that any data subsequently written to that block is liable
> > to failure. But if an individual page is bad because of,
> > say, power loss at the time of the write, that the other
> > pages within that block would be solid. But this is JUST A
> > GUESS.
>
> OK, the whole concept is a bit scary. But I guess an erase
> fail is more probable in a questionable eraseblock than a
> write fail of a member page before erasure and subsequent
> unreliable write success after erasure.
>
> If this is the case, then we're left with a discussion of how
> aggressive we should be about permanently retiring stuff,
> which is really just a discussion about how quickly the flash
> "wears out". That's not a big issue - but the possibility of
> writing data which later proves to be unreliable is.
This is why the ECC is so important, and why it has been
strengthend in later NAND technology. There are gray areas that
are not easily evaluated.
-imcd