>> I have compiled in the latest version of YAFFS as of what was in the public
>> repository Aug 6th and loaded it on my board.  I've been testing the
>> filesystem by untarring an archive and repeatedly copying the directory to
>> new directories, removing some of them, and then continuing copying. 
>> Things seems to be working fine.  When I went to go remove everything from
>> NAND and give the board to one of the other developers I got some more bad
>> messages from YAFFS.
>>
> Did you do any reboots during the above? Rebooting forces rescanning which 
> will test the ECC.

No I didn't.  But I have rebooted and ran nanddump in the past and today and 
noticed that there are normally no ECC errors.  However when I got NAND and 
YAFFS2 in a bad state today when I rebooted there were a lot of ECC errors just 
from the scan during boot.

> 
>> /mnt/nand # \rm -rf *
>> **>> Erasure failed 3515
>> **>> Block 3515 retired
>> Block 3515 is in state 9 after gc, should be erased
>> **>> Erasure failed 4201
>> **>> Block 4201 retired
>> Block 4201 is in state 9 after gc, should be erased
>> **>> Erasure failed 8045
>> **>> Block 8045 retired
>> Block 8045 is in state 9 after gc, should be erased
> 
> 
> It looks like the erasure command failed. Try instrumenting the erase function 
> in the mtd.

I have instrumented this function, and debugged it using the logic analyzer. 
The NAND device is returning 0xE1 as a status after erasing this block.  So it 
was indeed an error according to the part.  I'm not sure why the bad block 
wasn't marked successfully though.  When I got the part in a state with a few of 
these bad blocks, even mtd-utils flash_erase and flash_eraseall would complain 
about an I/O error on that block.  mtd-utils doesn't mark blocks as bad when it 
fails erasure, and eventually it was successfully erased.  That isn't a good 
thing though, because I don't know which block it was, and it's writing and 
erasing okay at the moment, who knows when it will fail again.

> Was it just those few blocks kicking up a problem or was there a whole slew of 
> them? If it was just a few then those might be real bad blocks and then the 
> above is OK.

 From my understanding though, NAND parts should not be throwing up bad blocks 
on a regular basis.  My experience is that most have a few blocks marked bad 
from the factory and then a device might throw up a few more on a first pass 
erase. From there on, a block going bad is a rare event.  Not something I'd be 
able to cause on a daily basis with a filesystem like YAFFS that has wearleveling.

>  From the /proc/yaffs it looks like many erasures worked and only a few 
> failed. That indicates that the mtd did not tell yaffs these were bad blocks.

Agreed.  The blocks that got marked bad were not detected by MTD as bad blocks, 
and were not passed to YAFFS when it was requesting a list of bad blocks.

> I would inspect the bad block marking and ID strategy and make sure it is 
> working OK.

I've gone over it.  I'm using Linux's standard bad blocking method.  The first 
two bytes of the first two pages in a block have to be 0xFF for it to be 
considered a good block.  When a block is told to be marked bad the mtd driver 
attempts to write these 4 bytes to 0x00.  As long as 1 bit out of those 4 bytes 
turns to a zero, the block should be detected as bad on next boot during a bad 
block scan.  I do need to figure out why a block isn't being marked bad when 
YAFFS makes that request.  I have added trace code into 
nand_default_block_markbad in MTD to verify that it is executing properly.

> Many ECC errors suggest that your mtd is trying to use the same oob bytes for 
> both data and ECC and/or bad block markers
> 
> When yaffs reads/writes spare bytes it just passes a contiguous buffer (say 
> yyyyyyyy)
> 
> Now let's say the mtd is using 6 bytes for ECC (shon as e) and 2 bytes for the 
> bad block table shown by b
> 
> The actual oob placement might end up being
>  bbyyyyyeeeyyyyyee
> or maybe 
> bbeeeeeeyyyyyyyyy
> or whatever
> and it is the job of the mtd to sort this out.

I don't think I have any issues with conflicts of bytes in the OOB area.  This 
is a break down of the 128 bytes.

0-1: Bad block marker
2-79: Unused (YAFFS has its 28 byte tag information inserted here)
80-127: ECC

My ecclayout is defined as:

static struct nand_ecclayout nand_oob_128 = {
   .eccbytes = 48,
   .eccpos = {
        80, 81, 82, 83, 84, 85, 86, 87,
        88, 89, 90, 91, 92, 93, 94, 95,
        96, 97, 98, 99, 100, 101, 102, 103,
        104, 105, 106, 107, 108, 109, 110, 111,
        112, 113, 114, 115, 116, 117, 118, 119,
        120, 121, 122, 123, 124, 125, 126, 127},
   .oobfree = {
     {.offset = 2,
      .length = 78}}
};

Which means any oob data passed to MTD from yaffs should end up in bytes 2-79 as 
I mentioned above.  I put in some trace code to confirm this:

nandmtd2_WriteChunkWithTagsToNAND
calling write_oob
         4096 ib bytes
         28 oob bytes
yaffs_oob:
         d8 6a 00 00 02 01 00 10 01 00 00 80 00 00 00 00
         26 00 00 00 05 00 00 00 fa ff ff ff
calling nand_do_write_ops
calling nand_fill_oob
inserting offset by 2 bytes
OOB:
         ff ff d8 6a 00 00 02 01 00 10 01 00 00 80 00 00
         00 00 26 00 00 00 05 00 00 00 fa ff ff ff ff ff
         ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
         ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
         ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
         aa aa 67 aa 59 6b ff ff ff ff ff ff ff ff ff ff
         ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
         ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

I haven't seen YAFFS2 pass more than 28 bytes of tag information, is that what I 
should suspect?  As long as it doesn't pass more than 78 bytes there should be 
lots of room for the tag information.  Since it fits into a 64 byte OOB I can't 
see it as being an issue.  I assume that YAFFS2's tag information doesn't grow 
with block size correct?

My gut feeling at the moment is that this isn't a YAFFS2 issue, but an issue at 
a lower level.  Thanks for your help so far.  I've certainly learned a lot more 
about MTD, and how YAFFS2 interfaces with it.

I'm going to go back to testing the NAND directly with the MTD layer as see if I 
can get the NAND to do strange things from there.  I'm also going to look into 
back porting newer MTD code into our 2.6.20.4 kernel to see if that fixes the 
problem.  I've mentioned some of my issues on the MTD mailing list but haven't 
really gotten a response on that end.

Andrew McKay
Iders Inc.