Re: [Yaffs] Bad block management

Top Page
Attachments:
Message as email
+ (text/plain)
Delete this message
Reply to this message
Author: Jacob Dall
Date:  
To: manningc2, yaffs
Subject: Re: [Yaffs] Bad block management
Hello Charles,

Thank you very much for replying - I really appreciate it.

> On Thursday 20 January 2005 23:02, Jacob Dall wrote:
> > Hello yaffers,
> >
> > I've a few questions regarding why yaffs' bad block management is designed
> > the way it is.
> >
> > According to Toshiba, NAND failures can be distinguished as "permanent
> > failures" or "soft errors"
> >
> > 1) Permanent failures: this error occurs when programming or erasing, and
> > can be detected by reading the status register after operation.
> >
> > 2) Soft errors: this error occurs during a program, but can only be
> > detected by reads. The error is cleared by a block erase.
> >
> > Now, upon read, if yaffs detects an unfixable ECC error in a page, the
> > block holding that page is marked as bad. According to 2) it would be ok to
> > just mark the page as discarded and let the garbage collector do its job -
> > or have I missed something?
>
> This mechanism was designed before Toshiba shared their wonderful document
> with the world. I have considered changing this, but it has never been a very
> high priority and it does put data at risk.
>
> The "soft errors" are typically write disturb failures that can (hopefully)
> be fixed by ECC. My concern is that if a block displays write disturb
> problems then perhaps it is "going bad". ECC can only fix single bit errors.
> I don't want to wait until it has "gone bad" and lost data before I retire
> it. I'd prefer to retire dodgy looking blocks earlier.


Actually, having looked at the yaffs1 internals, I think it has already been changed - the RetireBlock() is only called from yaffs_BlockBecameDirty().

>
> >
> > In yaffs, a block is marked bad by writing 0 to byte 517 in page 0 / 1 in
> > the block. Why wasn't it decided to use another value (for instance, like
> > SmartMedia's 0xF0). Then it would have been possible to destinguish initial
> > bad blocks from operational bad blocks.
>
> This was considered. However I decided to use 0x00 because this would have
> the most likelihood of programming a block where the bits don't "stick"well.
> A sparse bit pattern is less likely to program than all 0s.
>
> THis could be changed quite easily.
>
> Generally the factory marked bad blocks are not just marked with this byte.
> Mostly the whole OOB area or even the whole block is marked zero. THis
> generally makes it easy enough to distnguish factor marked from YAFFS-marked
> bad blocks.
>
> >
> > I've an issue with some of my devices - bad blocks is increased very
> > rapidly. Beyond the fact that it's due to ECC read errors, I'm yet to
> > discover the root of the problem.
>
>
> I've done extensive lifetime testing on some devices. One test I did wrote
> approx 130GB stuff, read and verified it with not one ECC failure or bit
> getting munged.
>
> Some other people doing lifetime testing have expressed concern because they
> lose 1-2% of flash during the lifetime of a device.
>
> What do you mean by rapidly? I assume it is far worse than either of these!


Yes, it's far worse. Imagine having a system that, when looked at, has 2 bad blocks. One hour later it has over 500!!
And this in a system that every 15 second writes approc. 10KB of data

>
> If you're using Linuxx, then the most likely cuases of the problem are a miss
> match between the ECC strategy you're using in YAFFS and what you have
> configured in mtd.


I'm using yaffs1/direct

>
> >
> > I'm not blaming yaffs - I'm sure the problem is to be found else where, but
> > I'm thinking really hard of making those changes to yaffs, making me able
> > to get back to the state when the NAND was first taken into use.
> >
> > Please let me know your reasons / thoughts...
>
> Being able to change the bad block marker would help you with bench testing
> until you have fixed the real problem.
>
> There are two things you could try:
> 1) In yaffs_RetireBlock, change the blockstatus to some easy to detect value
> that has at least two zero bits (eg. 0xFC).
> 2) Or even turn off the writing of bad block markers completely. This would
> cause problems in the file system state, but that probably does not matter
> for you at the moment.
>
> Of course I'm assuming you just want to do these changes while you find and
> fix the real problem. I would not suggest shipping product with either of
> these changes.
>
> >
> >
> > Thanks and regards,
> > Jacob Dall
> >
> > FYI: the 'According to Toshiba' stuff was taken from a document named 'NAND
> > Flash Application Design Guide'
>
> Great doc. Should be required reading for anyone working with NAND.
>
> >
> >
> > _______________________________________________
> > yaffs mailing list
> >
> > http://stoneboat.aleph1.co.uk/cgi-bin/mailman/listinfo/yaffs