Hello Charles, Thank you very much for replying - I really appreciate it. > On Thursday 20 January 2005 23:02, Jacob Dall wrote: > > Hello yaffers, > > > > I've a few questions regarding why yaffs' bad block management is designed > > the way it is. > > > > According to Toshiba, NAND failures can be distinguished as "permanent > > failures" or "soft errors" > > > > 1) Permanent failures: this error occurs when programming or erasing, and > > can be detected by reading the status register after operation. > > > > 2) Soft errors: this error occurs during a program, but can only be > > detected by reads. The error is cleared by a block erase. > > > > Now, upon read, if yaffs detects an unfixable ECC error in a page, the > > block holding that page is marked as bad. According to 2) it would be ok to > > just mark the page as discarded and let the garbage collector do its job - > > or have I missed something? > > This mechanism was designed before Toshiba shared their wonderful document > with the world. I have considered changing this, but it has never been a very > high priority and it does put data at risk. > > The "soft errors" are typically write disturb failures that can (hopefully) > be fixed by ECC. My concern is that if a block displays write disturb > problems then perhaps it is "going bad". ECC can only fix single bit errors. > I don't want to wait until it has "gone bad" and lost data before I retire > it. I'd prefer to retire dodgy looking blocks earlier. Actually, having looked at the yaffs1 internals, I think it has already been changed - the RetireBlock() is only called from yaffs_BlockBecameDirty(). > > > > > In yaffs, a block is marked bad by writing 0 to byte 517 in page 0 / 1 in > > the block. Why wasn't it decided to use another value (for instance, like > > SmartMedia's 0xF0). Then it would have been possible to destinguish initial > > bad blocks from operational bad blocks. > > This was considered. However I decided to use 0x00 because this would have > the most likelihood of programming a block where the bits don't "stick"well. > A sparse bit pattern is less likely to program than all 0s. > > THis could be changed quite easily. > > Generally the factory marked bad blocks are not just marked with this byte. > Mostly the whole OOB area or even the whole block is marked zero. THis > generally makes it easy enough to distnguish factor marked from YAFFS-marked > bad blocks. > > > > > I've an issue with some of my devices - bad blocks is increased very > > rapidly. Beyond the fact that it's due to ECC read errors, I'm yet to > > discover the root of the problem. > > > I've done extensive lifetime testing on some devices. One test I did wrote > approx 130GB stuff, read and verified it with not one ECC failure or bit > getting munged. > > Some other people doing lifetime testing have expressed concern because they > lose 1-2% of flash during the lifetime of a device. > > What do you mean by rapidly? I assume it is far worse than either of these! Yes, it's far worse. Imagine having a system that, when looked at, has 2 bad blocks. One hour later it has over 500!! And this in a system that every 15 second writes approc. 10KB of data > > If you're using Linuxx, then the most likely cuases of the problem are a miss > match between the ECC strategy you're using in YAFFS and what you have > configured in mtd. I'm using yaffs1/direct > > > > > I'm not blaming yaffs - I'm sure the problem is to be found else where, but > > I'm thinking really hard of making those changes to yaffs, making me able > > to get back to the state when the NAND was first taken into use. > > > > Please let me know your reasons / thoughts... > > Being able to change the bad block marker would help you with bench testing > until you have fixed the real problem. > > There are two things you could try: > 1) In yaffs_RetireBlock, change the blockstatus to some easy to detect value > that has at least two zero bits (eg. 0xFC). > 2) Or even turn off the writing of bad block markers completely. This would > cause problems in the file system state, but that probably does not matter > for you at the moment. > > Of course I'm assuming you just want to do these changes while you find and > fix the real problem. I would not suggest shipping product with either of > these changes. > > > > > > > Thanks and regards, > > Jacob Dall > > > > FYI: the 'According to Toshiba' stuff was taken from a document named 'NAND > > Flash Application Design Guide' > > Great doc. Should be required reading for anyone working with NAND. > > > > > > > _______________________________________________ > > yaffs mailing list > > yaffs@stoneboat.aleph1.co.uk > > http://stoneboat.aleph1.co.uk/cgi-bin/mailman/listinfo/yaffs