[Yaffs] bit error rates

Tue Feb 7 21:04:29 GMT 2006

On Wednesday 08 February 2006 05:58, Richard A. Smith wrote:
> Charles Manning wrote:
>  > The best guide I have read is the Toshiba NAND flash applications design
>  > guide, available at various locations including
>  > http://www.edn.com/contents/images/ToshibaNANDFlash1.pdf
>
> Thank you! that was _exactly_ what I needed.  It answered all my
> questions and even some I had not though of yet.
>
> > I don't believe that there is any "read disturb". Once written, AFAIK
> > only other writes are likely to mess things up.
>
> Nope.  See page 22 of the doc you pointed me to.
>
> Read Disturb — In this failure mode, a read operation can disturb the
> memory contents causing a “1” to change to
> a “0.” The bit error occurs on another page in the block, not the page
> being read.
>
> Its non-permanent though an erase will fix it and _really_ unlikely.
>
> The ROM section of the document discussed that in their testing it was
> 3ppm over 10 years.  So 3 blocks out of every million blocks will have a
> 1 bit error in 10 years.
>
> As you said the program-disturb is more common.  Although still pretty
> rare. 1E-10 or 1 bit per 10 billion

Thanx for the correction.

All of these failures get handles by ECC, but ECC is limited to 1 bit /256 
bytes. 

NAND is getting more and more reliable, IMHO. Most devices will work fine 
(except for the factory marked bad blocks). Some will lose some blocks in the 
first pass or two (marginal blocks missed by the factory check, perhaps), 
then settle down to a long and useful service. Some might take longer to 
settle down or might lose blocks slowly for a long time.

In the 100Gbyte test I mentioned, I saw no blocks go bad. This unit had been 
used for a couple of weeks toing thrash tests before this and I expect all 
had blocks had been marked.

YAFFS takes a fairly cautious approach to handling bad blocks. If a block 
fails an ECC test then it's retired at the next garbage collection of that 
block (ie we suck out the data first).

According to folks I have discussed this with at Toshiba, a block is likely to 
display 1-bit (recoverable) ECC errors long before going truly bad. Therefore 
I think the above strategyy should work prety well.

>
> > YAFFS2 does no rewrites (ie only one write per page and no deletion
> > markers.
>
> Is YAFFS2 ready for production?  I've been looking through the code and
> I see a lot of FIXMEs and TODOs.

YAFFS2 has been used in non-Linux systems for at least a year and has been 
evolving. Some of those evolutions broke a few things, including the deleted 
hardlink handling (fixed in December 2005). 

There way a problem where corrupted tags (due to a bad mtd-hook-up) could 
cause a crash. That was addressed a few days ago and would only have been 
seen by people with mtd problems.

At the moment the main trauma is hooking up to the mtd. Once this is done, 
YAFFS performs well, IMHO. The yaffs_mtdif2.c code does not work at present 
with a stock kernel and various efforts are underway to fix this. Sergey 
posted a patch which almost works (did not fix the problem on the board I'm 
playing with).

I have a unit running YAFFS2 on Linux on my desk right now. It is running fine 
(read/write/delete/garbage collection/...). It currently has the busybox 
interaction problem (see 
http://stoneboat.aleph1.co.uk/pipermail/yaffs/2005q4/001645.html). I have not 
tried this, and will try to implement a different fix today. I don't consider 
this a "serious flaw" like data loss or crashing, but it is one that should 
be fixed

Basically, my take on this is that since the beginning on the year there are 
no serious YAFFS2 problems apart from hooking up to the mtd. YAFFS Direct 
users don't have this problem.

>
> The no deletion markers is a bit confusing.  I've not yet groked how
> YAFFS2 does this.  Care to enlighten me?
Most of the method is sketched (briefly) in
http://www.aleph1.co.uk/yaffs/yaffs2.html

YAFFS2 uses the sequence number to determine the flow of time and what has 
happened to the fs.  But in very brief:
* When we write a new chunk the previous one is discarded. We can look at the 
sequence numbers to determine which one is valid and which is not.
* Operations such as resizing a file are handled by writing a new file header 
stating the limits of the file.
* We handle file deletion my moving the file to a fake "deleted directory".

>
> > NAND flash seems to be getting more reliable all the time. I did some
> > accelerated lifetime testing where I wrote and verified over 100Gbytes
> > of data without a single bit being damaged.
>
> Good news.  I'll bet the 1e-10 error rate is at the max rated operating
> temp of the part.  So in the normal temp range the error rate is
> probably far lower.

The other factor is that the error rates are also determined using a maximum 
number of writes per page (too many trigger write disturbs). Since YAFFS2 
only ever does one write per paged it is less likely to cause this.

>
> > YAFFS direct is vanilla C and should compile fine for just about
> > anything.
>
> Excellent.  Has anyone actually done it though?
> Hopefully, with no weird niosII compiler bugs or linker problems.

Most definitely. YAFFS Direct is being used by a few people in a wide range of 
applications.

I also use YAFFS Direct as the primary test bed for any yaffs_guts work.

-- CHarles