Arvind, On Friday 29 June 2007 17:30, Arvind Agrawal wrote: > O.K. I digged into the YAFFS2-.yaffs_mtd1f2.c and mtd/nand > code and found a potential BUG which may cause large numbers > of the BLOCKs marked bad. I have not figured out yet that what > conditions may cause this BUG to show up... > > yaffs2 calls mtd->write_oob(mtd, addr, &ops) with ops.databuf > and ops.oobbuf both set. > Which translates into (linux-2.6.20) as nand_do_write_ops(). > > This functions memsets "chip->oob_poi" to 0xFFs ONLY IF oob is > NULL otherwise, as in case of yaffs2 writes, nand_fill_oob() > is called which fills in the buffer "chip->oob_poi" starting > at offset "chip->ecc.layout->oobfree->offset" which in case of > large page nands is set 2 and is used for BAD BLOCK marking. > > This assumes that "chip->oob_poi" is always (atleast byte 0 > and 1) initialised to 0xFF. > Nowhere in the code I noticed it to be initialised to 0xFF > and probably only reason it works that the code is also doing > nand_read_oob() which is initialising it the buffer and first > 2 bytes of chip->oob_poi will be initialized to 0xFF as they > are being read from good blocks. > > But once chip->oob_poi has or get non 0xFF bytes in first 2 > bytes, any data written onwards by YAFFS2 will turn all the > blocks written to BAD Blocks and that's what I have seen in > TWO instances of excessive and consecutive blocks marked bad. > > Now looking at the code, I have not figure out if there is any > other condition where chip->oob_poi, first 2 bytes can be > initailsed to non 0xFF values. Only condition I could think of > is a very long shot, and can be caused by Bit Flipping on byte > 0 when doing a nand_read_oob(). 1 bit Bitflipping on databuf > may be corrected by ECC but on OOB bad block bytes no action > is taken. > But then again Bit flipping may be caused on BLOCKs which are > in kind of wearing out state and should not happen on new NAND > chips. > > I need input on this from MTD and YAFFS gurus or anybody else > who may have seen similar issues. > First do you agree with my analysis and if yes , can you think > of anyother situation which may caused this BUG(??) to pop > up.. Arvind, I have just looked over the code and concur with you that this is a problem. I don't see any simple/reliable fix that could be included in Yaffs code as a workaround. Perhaps we should prepare a patch to include with Yaffs. > But in anycase, in function nand_do_write_ops() in nand_base.c > (linux-2.6.20 onwards) we should probably add > > > /* If we're not given explicit OOB data, let it be 0xFF */ > if (likely(!oob)) > memset(chip->oob_poi, 0xff, mtd->oobsize); > > with ---------------- > > /* If we're not given explicit OOB data, let it be 0xFF */ > if (likely(!oob)) > memset(chip->oob_poi, 0xff, mtd->oobsize); > else > memset(chip->oob_poi, 0xff, > chip->ecc.layout->oobfree->offset); Perhaps simply do the memset unconditionally -- it's less work than running through the ecc.layout->oobfree array to figure out what to 0xff, and the data is needed (in cache) for update and writing out to NAND shortly thereafter. -imcd