On 02/14/2012 09:14 PM, CHEN XUEQIN wrote: > Hi Peter: > Thank you for your tip. > > 于 2012年02月15日 00:56, Peter Barada 写道: > >> On 02/14/2012 11:47 AM, CHEN XUEQIN wrote: >>> Hi Peter: >>> >>> 于 2012年02月13日 23:09, Peter Barada 写道: >>> >>>>>> Here is my question: >>>>>> 1. Is my patch wrong? >>>>>> 2. Why the official yaffs2 code assume 3 chunkErrorStrike to >>>>>> retire a block? Reduce to 1 chunkErrorStrike will wrongly >>>>>> mark the good block bad? >>>>>> 3. Should I remove the patch? >>>>>> >>>>>> Thanks a lot for your advice. >>>> Yes, your patch is wrong as any read error will retire the block. >>>> >>>> If you see bit-flips from data read out of MTD, then your NAND driver >>>> isn't properly using ECC to correct the data. If MTD used ECC to >>>> correct the data you would see a -EUCLEAN return from MTD on read which >>>> will percolate through yaffs_HandleChunkError() - and increment the >>>> strike count. >>> Thanks for your reply. Now I know patch is wrong. I've read the samsung >>> nand chip data sheet and anylyse the kernel log. I think so many blocks struck >>> out are produced by errors in write operation. But it's very strange why those >>> block went into program error state. According to chip datasheet, if program >>> operation results in an error, map out the block including the page in error >>> and copy the target data to another block. Then it's reasonable for yaffs to >>> retire the block in yaffs_HandleWriteChunkError even if chunk Error Strike count >>> only be one. But why so many program errors? Any ideas? >>> >>> In addition, I used hardware ECC in MTD driver, the error correcting code >>> is hamming code. The nand chip is MLC mode, so hardware ECC can't correct multi >>> bit error and mtd return read error to yaffs, this may increase the number or >>> blocks struck out. I wondered how yaffs handle the uncorrectable bit error in >>> order to keep filesytem data reliability and integrality. If yaffs2 key data >>> read from nand is error in some bits, how can yaffs2 work without crash? >>> >> From all appearances your MTD driver is nor properly handling ECC, >> either in the write or the read. I assume that on reads if you see a >> single bit-flip and there's no error from MTD, then MTD is *not* >> applying ECC on the read to correct any flipped bits. Its the job of >> the MTD driver to properly compute and write the ECC, and then apply the >> ECC on the read to correct the possible flipped bits - this is why ECC >> is used in NAND, to improve the reliability of the data to make sure >> that the UBER (un-correctable bit error) rate is low (somewhere around >> 10E-15). Without proper ECC NAND can easily show a UBER of 10E-8 or >> higher which is what I think you are seeing. >> > From the kernel log, my MTD driver gave multi bits flip error and could > not correct the bits. The nand controler only support single bit > flip correction. But the rate of UBER is too high in my devices. My > deivces only worked for about half a year and then many error were generated. > May I try some software ECC such as BCH code to replace hardware ecc? I > wonder how about the CPU usage of software ECC? To find out the CPU usage of software ECC you'll have to configure/code it into your kernel, boot it and then measure it... >> If YAFFS sees errors on reads it increments the strike count and if it >> hits the limit then it will mark the block bad. This may be what your >> seeing. You need to test your MTD driver implementation *independent* >> of YAFFS to make sure it is operating as expected. Once you *know* your >> MTD driver works correctly then YAFFS should work fine... >> > Yes, I should the the MTD driver implementation. I wrote some code to > fill the nand block, read the block, and erase block. Maybe the code was > too simple to find the problem. Any open source MTD test program available ? The MTD drivers in the kernel include test modules; look at http://www.linux-mtd.infradead.org/doc/general.html#L_mtd_tests for more information. > > Regards, > Xueqin Chen -- Peter Barada peter.barada@logicpd.com