Re: [Yaffs] Yaffs2 erasure issue on MT29 NAND part

Top Page
Attachments:
Message as email
+ (text/plain)
Delete this message
Reply to this message
Author: Charles Manning
Date:  
To: yaffs
Subject: Re: [Yaffs] Yaffs2 erasure issue on MT29 NAND part
Well it looks like you're onto the right cause. A few comments below.

On Saturday 08 August 2009 11:01:41 Andrew McKay wrote:
> >> I have compiled in the latest version of YAFFS as of what was in the
> >> public repository Aug 6th and loaded it on my board. I've been testing
> >> the filesystem by untarring an archive and repeatedly copying the
> >> directory to new directories, removing some of them, and then continuing
> >> copying. Things seems to be working fine. When I went to go remove
> >> everything from NAND and give the board to one of the other developers I
> >> got some more bad messages from YAFFS.
> >
> > Did you do any reboots during the above? Rebooting forces rescanning
> > which will test the ECC.
>
> No I didn't. But I have rebooted and ran nanddump in the past and today
> and noticed that there are normally no ECC errors. However when I got NAND
> and YAFFS2 in a bad state today when I rebooted there were a lot of ECC
> errors just from the scan during boot.
>
> >> /mnt/nand # \rm -rf *
> >> **>> Erasure failed 3515
> >> **>> Block 3515 retired
> >> Block 3515 is in state 9 after gc, should be erased
> >> **>> Erasure failed 4201
> >> **>> Block 4201 retired
> >> Block 4201 is in state 9 after gc, should be erased
> >> **>> Erasure failed 8045
> >> **>> Block 8045 retired
> >> Block 8045 is in state 9 after gc, should be erased
> >
> > It looks like the erasure command failed. Try instrumenting the erase
> > function in the mtd.
>
> I have instrumented this function, and debugged it using the logic
> analyzer. The NAND device is returning 0xE1 as a status after erasing this
> block. So it was indeed an error according to the part. I'm not sure why
> the bad block wasn't marked successfully though. When I got the part in a
> state with a few of these bad blocks, even mtd-utils flash_erase and
> flash_eraseall would complain about an I/O error on that block. mtd-utils
> doesn't mark blocks as bad when it fails erasure, and eventually it was
> successfully erased. That isn't a good thing though, because I don't know
> which block it was, and it's writing and erasing okay at the moment, who
> knows when it will fail again.
>
> > Was it just those few blocks kicking up a problem or was there a whole
> > slew of them? If it was just a few then those might be real bad blocks
> > and then the above is OK.
>
> From my understanding though, NAND parts should not be throwing up bad
> blocks on a regular basis. My experience is that most have a few blocks
> marked bad from the factory and then a device might throw up a few more on
> a first pass erase. From there on, a block going bad is a rare event. Not
> something I'd be able to cause on a daily basis with a filesystem like
> YAFFS that has wearleveling.


That's my normal experience with NAND too. Generally the first few passes
throw up some bad blocks then things settle down.

>
> > From the /proc/yaffs it looks like many erasures worked and only a few
> > failed. That indicates that the mtd did not tell yaffs these were bad
> > blocks.
>
> Agreed. The blocks that got marked bad were not detected by MTD as bad
> blocks, and were not passed to YAFFS when it was requesting a list of bad
> blocks.
>
> > I would inspect the bad block marking and ID strategy and make sure it is
> > working OK.
>
> I've gone over it. I'm using Linux's standard bad blocking method. The
> first two bytes of the first two pages in a block have to be 0xFF for it to
> be considered a good block. When a block is told to be marked bad the mtd
> driver attempts to write these 4 bytes to 0x00. As long as 1 bit out of
> those 4 bytes turns to a zero, the block should be detected as bad on next
> boot during a bad block scan. I do need to figure out why a block isn't
> being marked bad when YAFFS makes that request. I have added trace code
> into
> nand_default_block_markbad in MTD to verify that it is executing properly.
>
> > Many ECC errors suggest that your mtd is trying to use the same oob bytes
> > for both data and ECC and/or bad block markers
> >
> > When yaffs reads/writes spare bytes it just passes a contiguous buffer
> > (say yyyyyyyy)
> >
> > Now let's say the mtd is using 6 bytes for ECC (shon as e) and 2 bytes
> > for the bad block table shown by b
> >
> > The actual oob placement might end up being
> > bbyyyyyeeeyyyyyee
> > or maybe
> > bbeeeeeeyyyyyyyyy
> > or whatever
> > and it is the job of the mtd to sort this out.
>
> I don't think I have any issues with conflicts of bytes in the OOB area.
> This is a break down of the 128 bytes.
>
> 0-1: Bad block marker
> 2-79: Unused (YAFFS has its 28 byte tag information inserted here)
> 80-127: ECC
>
> My ecclayout is defined as:
>
> static struct nand_ecclayout nand_oob_128 = {
>    .eccbytes = 48,
>    .eccpos = {
>         80, 81, 82, 83, 84, 85, 86, 87,
>         88, 89, 90, 91, 92, 93, 94, 95,
>         96, 97, 98, 99, 100, 101, 102, 103,
>         104, 105, 106, 107, 108, 109, 110, 111,
>         112, 113, 114, 115, 116, 117, 118, 119,
>         120, 121, 122, 123, 124, 125, 126, 127},
>    .oobfree = {
>      {.offset = 2,
>       .length = 78}}
> };

>
> Which means any oob data passed to MTD from yaffs should end up in bytes
> 2-79 as I mentioned above. I put in some trace code to confirm this:
>
> nandmtd2_WriteChunkWithTagsToNAND
> calling write_oob
>          4096 ib bytes
>          28 oob bytes
> yaffs_oob:
>          d8 6a 00 00 02 01 00 10 01 00 00 80 00 00 00 00
>          26 00 00 00 05 00 00 00 fa ff ff ff
> calling nand_do_write_ops
> calling nand_fill_oob
> inserting offset by 2 bytes
> OOB:
>          ff ff d8 6a 00 00 02 01 00 10 01 00 00 80 00 00
>          00 00 26 00 00 00 05 00 00 00 fa ff ff ff ff ff
>          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>          aa aa 67 aa 59 6b ff ff ff ff ff ff ff ff ff ff
>          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

>
> I haven't seen YAFFS2 pass more than 28 bytes of tag information, is that
> what I should suspect? As long as it doesn't pass more than 78 bytes there
> should be lots of room for the tag information. Since it fits into a 64
> byte OOB I can't see it as being an issue. I assume that YAFFS2's tag
> information doesn't grow with block size correct?


It writes a fixed-sized structure.

>
> My gut feeling at the moment is that this isn't a YAFFS2 issue, but an
> issue at a lower level. Thanks for your help so far. I've certainly
> learned a lot more about MTD, and how YAFFS2 interfaces with it.


Unfortunately NAND often results in having to learn far more than you'd like.

>
> I'm going to go back to testing the NAND directly with the MTD layer as see
> if I can get the NAND to do strange things from there. I'm also going to
> look into back porting newer MTD code into our 2.6.20.4 kernel to see if
> that fixes the problem. I've mentioned some of my issues on the MTD
> mailing list but haven't really gotten a response on that end.


Let us know how you get on. This is interesting for everyone.

Charles