On 02/14/2012 11:47 AM, CHEN XUEQIN wrote:
> Hi Peter:
>
> 于 2012年02月13日 23:09, Peter Barada 写道:
>
>>>> Here is my question:
>>>> 1. Is my patch wrong?
>>>> 2. Why the official yaffs2 code assume 3 chunkErrorStrike to
>>>> retire a block? Reduce to 1 chunkErrorStrike will wrongly
>>>> mark the good block bad?
>>>> 3. Should I remove the patch?
>>>>
>>>> Thanks a lot for your advice.
>> Yes, your patch is wrong as any read error will retire the block.
>>
>> If you see bit-flips from data read out of MTD, then your NAND driver
>> isn't properly using ECC to correct the data. If MTD used ECC to
>> correct the data you would see a -EUCLEAN return from MTD on read which
>> will percolate through yaffs_HandleChunkError() - and increment the
>> strike count.
>
> Thanks for your reply. Now I know patch is wrong. I've read the samsung
> nand chip data sheet and anylyse the kernel log. I think so many blocks struck
> out are produced by errors in write operation. But it's very strange why those
> block went into program error state. According to chip datasheet, if program
> operation results in an error, map out the block including the page in error
> and copy the target data to another block. Then it's reasonable for yaffs to
> retire the block in yaffs_HandleWriteChunkError even if chunk Error Strike count
> only be one. But why so many program errors? Any ideas?
>
> In addition, I used hardware ECC in MTD driver, the error correcting code
> is hamming code. The nand chip is MLC mode, so hardware ECC can't correct multi
> bit error and mtd return read error to yaffs, this may increase the number or
> blocks struck out. I wondered how yaffs handle the uncorrectable bit error in
> order to keep filesytem data reliability and integrality. If yaffs2 key data
> read from nand is error in some bits, how can yaffs2 work without crash?
>
>From all appearances your MTD driver is nor properly handling ECC,
either in the write or the read. I assume that on reads if you see a
single bit-flip and there's no error from MTD, then MTD is *not*
applying ECC on the read to correct any flipped bits. Its the job of
the MTD driver to properly compute and write the ECC, and then apply the
ECC on the read to correct the possible flipped bits - this is why ECC
is used in NAND, to improve the reliability of the data to make sure
that the UBER (un-correctable bit error) rate is low (somewhere around
10E-15). Without proper ECC NAND can easily show a UBER of 10E-8 or
higher which is what I think you are seeing.
If YAFFS sees errors on reads it increments the strike count and if it
hits the limit then it will mark the block bad. This may be what your
seeing. You need to test your MTD driver implementation *independent*
of YAFFS to make sure it is operating as expected. Once you *know* your
MTD driver works correctly then YAFFS should work fine...
--
Peter Barada
peter.barada@logicpd.com