On 02/14/2012 09:14 PM, CHEN XUEQIN wrote:
> Hi Peter:
>    Thank you for your tip.
>
> 于 2012年02月15日 00:56, Peter Barada 写道:
>
>> On 02/14/2012 11:47 AM, CHEN XUEQIN wrote:
>>> Hi Peter:
>>>
>>> 于 2012年02月13日 23:09, Peter Barada 写道:
>>>
>>>>>>          Here is my question:
>>>>>>              1. Is my patch wrong?
>>>>>>              2. Why the official yaffs2 code assume 3 chunkErrorStrike to
>>>>>>                 retire a block? Reduce to 1 chunkErrorStrike will wrongly
>>>>>>                 mark the good block bad?
>>>>>>              3. Should I remove the patch?
>>>>>>
>>>>>>          Thanks a lot for your advice.
>>>> Yes, your patch is wrong as any read error will retire the block.
>>>>
>>>> If you see bit-flips from data read out of MTD, then your NAND driver
>>>> isn't properly using ECC to correct the data.  If MTD used ECC to
>>>> correct the data you would see a -EUCLEAN return from MTD on read which
>>>> will percolate through yaffs_HandleChunkError() - and increment the
>>>> strike count.
>>>      Thanks for your reply. Now I know patch is wrong. I've read the samsung
>>> nand chip data sheet and anylyse the kernel log. I think so many blocks struck
>>> out are produced by errors in write operation. But it's very strange why those
>>> block went into program error state.  According to chip datasheet, if program
>>> operation results in an error, map out the block including the page in error
>>> and copy the target data to another block. Then it's reasonable for yaffs to
>>> retire the block in yaffs_HandleWriteChunkError even if chunk Error Strike count
>>> only be one. But why so many program errors? Any ideas?
>>>
>>>      In addition, I used hardware ECC in MTD driver, the error correcting code
>>> is hamming code. The nand chip is MLC mode, so hardware ECC can't correct multi
>>> bit error and mtd return read error to yaffs, this may increase the number or
>>> blocks struck out. I wondered how yaffs handle the uncorrectable bit error in
>>> order to keep filesytem data reliability and integrality. If yaffs2 key data
>>> read from nand is error in some bits, how can yaffs2 work without crash?
>>>
>> From all appearances your MTD driver is nor properly handling ECC,
>> either in the write or the read.  I assume that on reads if you see a
>> single bit-flip and there's no error from MTD, then MTD is *not*
>> applying ECC on the read to correct any flipped bits.  Its the job of
>> the MTD driver to properly compute and write the ECC, and then apply the
>> ECC on the read to correct the possible flipped bits - this is why ECC
>> is used in NAND, to improve the reliability of the data to make sure
>> that the UBER (un-correctable bit error) rate is low (somewhere around
>> 10E-15). Without proper ECC NAND can easily show a UBER of 10E-8 or
>> higher which is what I think you are seeing.
>>
>  From the kernel log, my MTD driver gave multi bits flip error and could
> not correct the bits. The nand controler only support single bit
> flip correction. But the rate of UBER is too high in my devices. My
> deivces only worked for about half a year and then many error were generated.
> May I try some software ECC such as BCH code to replace hardware ecc? I
> wonder how about the CPU usage of software ECC?

To find out the CPU usage of software ECC you'll have to configure/code
it into your kernel, boot it and then measure it...

>> If YAFFS sees errors on reads it increments the strike count and if it
>> hits the limit then it will mark the block bad.  This may be what your
>> seeing.  You need to test your MTD driver implementation *independent*
>> of YAFFS to make sure it is operating as expected.  Once you *know* your
>> MTD driver works correctly then YAFFS should work fine...
>>
> Yes, I should the the MTD driver implementation. I wrote some code to
> fill the nand block, read the block, and erase block. Maybe the code was
> too simple to find the problem. Any open source MTD test program available ?
The MTD drivers in the kernel include test modules; look at
http://www.linux-mtd.infradead.org/doc/general.html#L_mtd_tests for more
information.
>
> Regards,
> Xueqin Chen


-- 
Peter Barada
peter.barada@logicpd.com