I'll see what I can do.  These boards are suppose to ship off to a customer
of ours, but
I'll see if I can snag one and keep it stashed away for testing purposes.

Andrew

On Tue, 11 Aug 2009 10:54:46 +1200, Charles Manning
<manningc2@actrix.gen.nz> wrote:
> Hi Andrew
> 
> Thanks for the detailed investigation.
> 
> Please don't trash this board! This sounds like a good candidate for
> testing
> something I'm working on.
> 
> Bad block management is both a headache and a performance issue, so I'm
> looking at the mods needed to make yaffs (maybe only yaffs2 mode) work
> well
> without bad block marking. Instead of having mtd point out bad blocks,
> yaffs
> would figure it out as things go by watching what fails.
> 
> -- Charles
> 
> 
> 
> On Tuesday 11 August 2009 06:39:23 Andrew McKay wrote:
>> Hey Charles,
>>
>> I did end up finding a strange issue with the NAND part itself from a
> bunch
>> of testing I did this weekend.
>>
>> >> I'm going to go back to testing the NAND directly with the MTD layer
> as
>> >> see if I can get the NAND to do strange things from there.  I'm also
>> >> going to look into back porting newer MTD code into our 2.6.20.4
> kernel
>> >> to see if that fixes the problem.  I've mentioned some of my issues
> on
>> >> the MTD mailing list but haven't really gotten a response on that
> end.
>> >
>> > Let us know how you get on. This is interesting for everyone.
>>
>> With my logic analyzer I verified that the deletion process is working.
>> When YAFFS (or MTD) claims that an erase failed, the part is saying that
>> the erase failed.  However after the failure this block is no longer
>> readable or writable. I testing doing both with mtd-utils applications.
>> This of course means that when MTD tries to mark the block bad, it
> can't.
>> We might be able to fix the issue if we move to using a flash written
> bad
>> block table.  Currently we just scan the part for bad blocks on boot and
>> use a RAM based bad block table.
>>
>> I have verified timing of all the NAND control lines and used a scope to
>> verify the signals are clean and don't have excessive overshoot or
>> undershoot.  I can't see anything wrong at our end.  We're trying to
> source
>> different 16Gbit parts, and I'm trying to get into contact with a Micron
>> FAE to see if they have seen this issue.
>>
>> Here's the email I sent off to the MTD mailing list detailing the test
> and
>> the problems I saw.
>>
>>
>
---------------------------------------------------------------------------
>>---- Hey guys,
>>
>> I'm having issues with MT29F16G08DAA parts with MTD on Linux.  I have
> found
>> a strange issue with block erasure failures.  The part seems to get in a
>> state where if a block fails erasure, all pages with-in that block
>> (including the OOB area) will read all 0x00.  I realize that internally
> the
>> part may write all bits to 0 to prevent over erasure of already erased
>> bits, however if the device is powercycled, the data that was in that
> block
>> before the erase is still there.
>>
>> Here is a dump of what I'm seeing.
>>
>> /mnt/zen/mtd-utils # ./nanddump -s 0x29a00000 -l 0x1000 -p /dev/mtd12
>> ECC failed: 0
>> ECC corrected: 0
>> Number of bad blocks: 204
>> Number of bbt blocks: 0
>> Block size 262144, page size 4096, OOB size 128
>> Dumping data starting at 0x29a00000 and ending at 0x29a01000...
>> 0x29a00000: 03 00 00 00 09 75 00 00 ff ff 2e 73 76 6e 00 00
>> 0x29a00010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a000a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a000b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a000c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a000d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a000e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a000f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00100: 00 00 00 00 00 00 00 00 00 00 ff ff ed 41 00 00
>> 0x29a00110: 00 00 00 00 00 00 00 00 0a d6 7c 4a 0b d6 7c 4a
>> 0x29a00120: 0b d6 7c 4a ff ff ff ff ff ff ff ff ff ff ff ff
>> 0x29a00130: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>> 0x29a00140: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>> 0x29a00150: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>> [SNIP]
>> 0x29a00f40: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>> 0x29a00f50: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>> 0x29a00f60: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>> 0x29a00f70: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>> 0x29a00f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>> 0x29a00f90: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>> 0x29a00fa0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>> 0x29a00fb0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>> 0x29a00fc0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>> 0x29a00fd0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>> 0x29a00fe0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>> 0x29a00ff0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>>    OOB Data: ff ff cb 19 00 00 1d 75 00 30 09 75 00 80 00 00
>>    OOB Data: 00 00 25 7c 05 c4 06 00 00 00 f9 ff ff ff ff ff
>>    OOB Data: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>>    OOB Data: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>>    OOB Data: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>>    OOB Data: ff 00 c3 a9 6a 67 ff ff ff ff ff ff ff ff ff ff
>>    OOB Data: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>>    OOB Data: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>>
>>
>> /mnt/zen/mtd-utils # ./flash_erase /dev/mtd12 0x29a00000
>> Erase Total 1 Unnand_erase: start = 0x29a00000, len = 262144
>> its
>> Performing Flash Erase of length 262144 at offset
>> 0x29a00000single_erase_cmd nand_erase: Failed erase, page 0x00029a00
>>
>> MTD Erase failure: Input/output error
>>
>>
>> /mnt/zen/mtd-utils # ./nanddump -s 0x29a00000 -l 0x1000 -p /dev/mtd12
>> ECC failed: 0
>> ECC corrected:ECC: BAD
>>   0
>> Number of baECC: BAD
>> d blocks: 204
>> NECC: BAD
>> umber of bbt bloECC: BAD
>> cks: 0
>> Block siECC: BAD
>> ze 262144, page ECC: BAD
>> size 4096, OOB sECC: BAD
>> ize 128
>> DumpingECC: BAD
>>   data starting aECC: BAD
>> t 0x29a00000 andECC: BAD
>>   ending at 0x29aECC: BAD
>> 01000...
>> ECC: BAD
>> ECC: BAD
>> ECC: BAD
>> ECC: BAD
>> ECC: BAD
>> ECC: 16 uncorrectable bitflip(s) at offset 0x29a00000
>> 0x29a00000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a000a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a000b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a000c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a000d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a000e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a000f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00150: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> [SNIP]
>> 0x29a00f40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00f50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00f60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00f70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00f90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00fa0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00fb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00fd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00fe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x29a00ff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>    OOB Data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>    OOB Data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>    OOB Data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>    OOB Data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>    OOB Data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>    OOB Data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>    OOB Data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>    OOB Data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> /mnt/zen/mtd-utils #
>>
>> The big issue here is that I've also testing writing to this block after
> a
>> failed erasure, and I can't seem to write to it either.  I tried forcing
>> the block to all zeros after the failed erase, and on reboot the
> previous
>> data is still in the block.  This means that when MTD is told to mark
> this
>> block as bad, the write of the first two bytes in the first two pages of
>> the block also fails. Therefore the page never gets marked bad. 
> Eventually
>> an erase will work and the block goes back to a "working" state. 
> However
>> it'll end up in this strange state again at some point.
>>
>> Has anyone seen this happen before with NAND parts?  Is there a way to
>> avoid this?  I have tried issuing a RESET command to the part after a
>> failed erase, but all the pages in the block stay in this strange state.
>> The only thing that seems to recover the device is to power cycle it.
>>
>> I suppose I could change to using a BBT that is written to the NAND
> device,
>> hopefully then the bad blocks would be kept track of.  Though I've never
>> had an issue with marking blocks bad by writing the first two bytes of
> the
>> first two pages to 0x00 and have Linux build a BBT on the fly during
>> boot-up.
>>
>
---------------------------------------------------------------------------
>>----
>>
>>
>>
>> Andrew McKay
>> Iders Inc.