Re: [Yaffs] bad block management

Top Page
Attachments:
Message as email
+ (text/plain)
Delete this message
Reply to this message
Author: bpqw
Date:  
To: Charles Manning, yaffs@lists.aleph1.co.uk
CC: bpqw
Subject: Re: [Yaffs] bad block management
Hi Clarles,
We recommended if the bitflip over threshold we just need to refresh the block but not retire it.
So we doubt is it reasonable just according to the bitflips over
mtd->bitflip_threshold over three times to judge the block as bad block?

Br
White Ding
____________________________
EBU APAC Application Engineering
Tel:86-21-38997078
Mobile: 86-13761729112
Address: No 601 Fasai Rd, Waigaoqiao Free Trade Zone Pudong, Shanghai, China

-----Original Message-----
From: Charles Manning [mailto:cdhmanning@gmail.com]
Sent: Wednesday, August 06, 2014 8:21 AM
To:
Cc: bpqw
Subject: Re: [Yaffs] bad block management

On Friday 25 July 2014 16:50:25 bpqw wrote:
> Hi
>
> I have review the yaffs2 source code and have a doubt. See the follow
>
>
>
> In Yaffs2 the read interface is yaffs_rd_chunk_tags_nand int 
> yaffs_rd_chunk_tags_nand(struct yaffs_dev *dev, int nand_chunk,
>
>                        u8 *buffer, struct yaffs_ext_tags *tags) {
>
>       .........
>
>       result = dev->tagger.read_chunk_tags_fn(dev, flash_chunk, 
> buffer, tags);
>
>       if (tags && tags->ecc_result > YAFFS_ECC_RESULT_NO_ERROR) {
>
>
>
>             struct yaffs_block_info *bi;
>
>             bi = yaffs_get_block_info(dev,
>
>                                 nand_chunk /
>
>                                 dev->param.chunks_per_block);
>
>             yaffs_handle_chunk_error(dev, bi);
>
>       }
>
>       return result;
>
> }
>
>
>
> The yaffs_rd_chunk_tags_nand will call the mtd interface mtd_read_oob
>
>
>
> int mtd_read_oob(struct mtd_info *mtd, loff_t from, struct mtd_oob_ops
> *ops) {
>
>       int ret_code;
>
>       ops->retlen = ops->oobretlen = 0;
>
>       if (!mtd->_read_oob)
>
>             return -EOPNOTSUPP;
>
>       /*
>
>       * In cases where ops->datbuf != NULL, mtd->_read_oob() has 
> semantics
>
>       * similar to mtd->_read(), returning a non-negative integer
>
>       * representing max bitflips. In other cases, mtd->_read_oob() 
> may
>
>       * return -EUCLEAN. In all cases, perform similar logic to mtd_read().
>
>       */
>
>       ret_code = mtd->_read_oob(mtd, from, ops);
>
>       if (unlikely(ret_code < 0))
>
>             return ret_code;
>
>       if (mtd->ecc_strength == 0)
>
>             return 0;   /* device lacks ecc */
>
>       return ret_code >= mtd->bitflip_threshold ? -EUCLEAN : 0; }
>
>
>
> So if the bitflips num over mtd->bitflip_threshold the mtd_read_oob 
> will return -EUCLEAN and tags->ecc_result > YAFFS_ECC_RESULT_NO_ERROR.
>
> Then we will call yaffs_handle_chunk_error.
>
> void yaffs_handle_chunk_error(struct yaffs_dev *dev,
>
>                         struct yaffs_block_info *bi)
>
> {
>
>       if (!bi->gc_prioritise) {
>
>             bi->gc_prioritise = 1;
>
>             dev->has_pending_prioritised_gc = 1;
>
>             bi->chunk_error_strikes++;
>
>
>
>             if (bi->chunk_error_strikes > 3) {
>
>                   bi->needs_retiring = 1; /* Too many stikes, so 
> retire */
>
>                   yaffs_trace(YAFFS_TRACE_ALWAYS,
>
>                         "yaffs: Block struck out");
>
>
>
>             }
>
>       }
>
> }
>
>
>
> From the code we can see if bitflips num over mtd->bitflip_threshold 
> we will mark this block as gc if bitflips num over 
> mtd->bitflip_threshold over three times we will mark this block as bad block.
>
>
>
> We define bad block is if erase or program failed we can mark this 
> block as bad block.
>
> So is it reasonable just according to the bitflips over
> mtd->bitflip_threshold over three times to judge the block as bad block?
>
> What's your opinion about my doubts?


Hello White Ding

I apologise for taking a while to get back to looking at this.

First let me explain the history behind what is there.

In the beginning, there was SLC and Yaffs only supported two levels:
* Good: No ECC errors.
* Single bit ECC error: data is recoverable, but we are worried about a future failure.
* Multi-bit ECC error: bad.

In the beginning, the concern was that the blocks with a single bit error were on their way to going bad, so we better retire it soon.

Then bits got a bit worse, so we modified the policy slightly. A block with a single bit error got rewritten but if too many errors were observed then we retire the block.

Then with MLC and multi-bit ECC errors we move up to a new step. Single bit errors became common. Yaffs kept the same basic policy, but the drivers (at mtd level) start telling "lies".

For example in a multi-bit ECC system that fixes 4 bits, we might see:
0-2 bit errors are reported as zero errors.
3-4 bit errors reported as -EUCLEAN,

This is essentially the logic you are talking about here, but I need to dig into the mtd terminology a bit better to understand this fully.

Some flash parts (eg Micron MT29F8Gxxx parts)with built in ECC do not report the number of bit errors, but just a "please refresh" indicator.

I think we are now getting to a point where increasing numbers of bit errors are expected and should not be treated as a failure.

Thus we probably need a new level that does a refresh, but does not apply the three strikes failure policy.

For example, say something that supports 6 bit correcting we might want something like this:
0-2: These are expected, do nothing.
3-4: Refresh. Do not retire.
5-6: It looks like the block is failing. Suck the data off and retire if this happens too often.
7+: Data is corrupted.

If there are enough bits to make bands like this then it makes sense. However parts that hide the bad bits behind an ONFI-like interface do not really give us the data we need to make fine grained decisions.

I hope that helps.

-- Charles