Well it looks like you're onto the right cause. A few comments below. On Saturday 08 August 2009 11:01:41 Andrew McKay wrote: > >> I have compiled in the latest version of YAFFS as of what was in the > >> public repository Aug 6th and loaded it on my board. I've been testing > >> the filesystem by untarring an archive and repeatedly copying the > >> directory to new directories, removing some of them, and then continuing > >> copying. Things seems to be working fine. When I went to go remove > >> everything from NAND and give the board to one of the other developers I > >> got some more bad messages from YAFFS. > > > > Did you do any reboots during the above? Rebooting forces rescanning > > which will test the ECC. > > No I didn't. But I have rebooted and ran nanddump in the past and today > and noticed that there are normally no ECC errors. However when I got NAND > and YAFFS2 in a bad state today when I rebooted there were a lot of ECC > errors just from the scan during boot. > > >> /mnt/nand # \rm -rf * > >> **>> Erasure failed 3515 > >> **>> Block 3515 retired > >> Block 3515 is in state 9 after gc, should be erased > >> **>> Erasure failed 4201 > >> **>> Block 4201 retired > >> Block 4201 is in state 9 after gc, should be erased > >> **>> Erasure failed 8045 > >> **>> Block 8045 retired > >> Block 8045 is in state 9 after gc, should be erased > > > > It looks like the erasure command failed. Try instrumenting the erase > > function in the mtd. > > I have instrumented this function, and debugged it using the logic > analyzer. The NAND device is returning 0xE1 as a status after erasing this > block. So it was indeed an error according to the part. I'm not sure why > the bad block wasn't marked successfully though. When I got the part in a > state with a few of these bad blocks, even mtd-utils flash_erase and > flash_eraseall would complain about an I/O error on that block. mtd-utils > doesn't mark blocks as bad when it fails erasure, and eventually it was > successfully erased. That isn't a good thing though, because I don't know > which block it was, and it's writing and erasing okay at the moment, who > knows when it will fail again. > > > Was it just those few blocks kicking up a problem or was there a whole > > slew of them? If it was just a few then those might be real bad blocks > > and then the above is OK. > > From my understanding though, NAND parts should not be throwing up bad > blocks on a regular basis. My experience is that most have a few blocks > marked bad from the factory and then a device might throw up a few more on > a first pass erase. From there on, a block going bad is a rare event. Not > something I'd be able to cause on a daily basis with a filesystem like > YAFFS that has wearleveling. That's my normal experience with NAND too. Generally the first few passes throw up some bad blocks then things settle down. > > > From the /proc/yaffs it looks like many erasures worked and only a few > > failed. That indicates that the mtd did not tell yaffs these were bad > > blocks. > > Agreed. The blocks that got marked bad were not detected by MTD as bad > blocks, and were not passed to YAFFS when it was requesting a list of bad > blocks. > > > I would inspect the bad block marking and ID strategy and make sure it is > > working OK. > > I've gone over it. I'm using Linux's standard bad blocking method. The > first two bytes of the first two pages in a block have to be 0xFF for it to > be considered a good block. When a block is told to be marked bad the mtd > driver attempts to write these 4 bytes to 0x00. As long as 1 bit out of > those 4 bytes turns to a zero, the block should be detected as bad on next > boot during a bad block scan. I do need to figure out why a block isn't > being marked bad when YAFFS makes that request. I have added trace code > into > nand_default_block_markbad in MTD to verify that it is executing properly. > > > Many ECC errors suggest that your mtd is trying to use the same oob bytes > > for both data and ECC and/or bad block markers > > > > When yaffs reads/writes spare bytes it just passes a contiguous buffer > > (say yyyyyyyy) > > > > Now let's say the mtd is using 6 bytes for ECC (shon as e) and 2 bytes > > for the bad block table shown by b > > > > The actual oob placement might end up being > > bbyyyyyeeeyyyyyee > > or maybe > > bbeeeeeeyyyyyyyyy > > or whatever > > and it is the job of the mtd to sort this out. > > I don't think I have any issues with conflicts of bytes in the OOB area. > This is a break down of the 128 bytes. > > 0-1: Bad block marker > 2-79: Unused (YAFFS has its 28 byte tag information inserted here) > 80-127: ECC > > My ecclayout is defined as: > > static struct nand_ecclayout nand_oob_128 = { > .eccbytes = 48, > .eccpos = { > 80, 81, 82, 83, 84, 85, 86, 87, > 88, 89, 90, 91, 92, 93, 94, 95, > 96, 97, 98, 99, 100, 101, 102, 103, > 104, 105, 106, 107, 108, 109, 110, 111, > 112, 113, 114, 115, 116, 117, 118, 119, > 120, 121, 122, 123, 124, 125, 126, 127}, > .oobfree = { > {.offset = 2, > .length = 78}} > }; > > Which means any oob data passed to MTD from yaffs should end up in bytes > 2-79 as I mentioned above. I put in some trace code to confirm this: > > nandmtd2_WriteChunkWithTagsToNAND > calling write_oob > 4096 ib bytes > 28 oob bytes > yaffs_oob: > d8 6a 00 00 02 01 00 10 01 00 00 80 00 00 00 00 > 26 00 00 00 05 00 00 00 fa ff ff ff > calling nand_do_write_ops > calling nand_fill_oob > inserting offset by 2 bytes > OOB: > ff ff d8 6a 00 00 02 01 00 10 01 00 00 80 00 00 > 00 00 26 00 00 00 05 00 00 00 fa ff ff ff ff ff > ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff > ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff > ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff > aa aa 67 aa 59 6b ff ff ff ff ff ff ff ff ff ff > ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff > ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff > > I haven't seen YAFFS2 pass more than 28 bytes of tag information, is that > what I should suspect? As long as it doesn't pass more than 78 bytes there > should be lots of room for the tag information. Since it fits into a 64 > byte OOB I can't see it as being an issue. I assume that YAFFS2's tag > information doesn't grow with block size correct? It writes a fixed-sized structure. > > My gut feeling at the moment is that this isn't a YAFFS2 issue, but an > issue at a lower level. Thanks for your help so far. I've certainly > learned a lot more about MTD, and how YAFFS2 interfaces with it. Unfortunately NAND often results in having to learn far more than you'd like. > > I'm going to go back to testing the NAND directly with the MTD layer as see > if I can get the NAND to do strange things from there. I'm also going to > look into back porting newer MTD code into our 2.6.20.4 kernel to see if > that fixes the problem. I've mentioned some of my issues on the MTD > mailing list but haven't really gotten a response on that end. Let us know how you get on. This is interesting for everyone. Charles