Re: [Yaffs] What management for the paired pages in MLC NAND…

Top Page
Attachments:
Message as email
+ (text/plain)
Delete this message
Reply to this message
Author: Charles Manning
Date:  
To: yaffs
Subject: Re: [Yaffs] What management for the paired pages in MLC NAND Flash in YAFFS2 ?
On Saturday 01 January 2011 02:32:28 Romain Izard wrote:
> Hello,
>
> I'm currently working on a device that has reliability problems on
> NAND MLC flash, and that must run in an embedded environment. A former
> decision chose UBI+UBIFS for the flash file system, but this is not
> working as well as I would like. I fear that the problem is due to the
> fact that UBI does not handle the problem of paired pages properly, as
> it is stated in the documentation.
>
> The main problem with paired pages is as follows:
> - The multiple bits stored in a NAND Flash cells correspond to bits
> located in multiple pages in a NAND erase block
> - These pages are not contiguous in the writing order of the block,
> but are interleaved with other groups of related pages
> - An interruption (power cut, reset, etc.) during a write operation
> for a page will not only corrupt the contents of this page, but also
> the other pages that share the same NAND Flash cells. If the power cut
> occurs when we are writing to a page whose group already has
> programmed pages before, we will also corrupt those pages.
>
> This problem may not be present in all MLC NAND Flash chips, but it
> exits on the Samsung chip I am using, where each cell contains 2 bits,
> and writing to the second paired page puts the first one at risk.
>
> I tried to see what was done in YAFFS2 for paired pages management,
> but after browsing the documentation, and especially the "NAND Failure
> Mitigation" document, I did not see any information on the subject. A
> rapid overview of the code did not give any additional information.
> However, I noted that even though the documentation states that it is
> a work in progress, MLC NAND is considered as working with YAFFS2.
>
> Is this problem managed specifically ? Or is it that the structure of
> the file system makes the problem irrelevant ? I'm interested in any
> information on this subject.



YAFFS is not RAID and cannot correct for pages that are corrupted. That means
the information contained in those pages will be lost.

YAFFS does not store any essential file system structure on NAND so the file
system itself should not lose integrity.

A corrupt page will manifest in one of three ways:
A) If the block holds checkpoint data then the checkpoint will be corrupted
and will be discarded. A normal scan will recover the file system structure.
No data will be lost.

B) If the corrupted page held data then that data will be lost. If an older
version is found then that will be used. If not, the data will be presented
as a hole in the file.
C) If the corrupted page held an object header then that object header is now
lost.
i. If the object header held a deletion record then we will no longer remember
that we deleted this file and it might come back from the dead.
ii. If the file was renamed and there is an older version of the object header
then the file would appear with its old name.
iii. If there is no older version then the file will appear in lost + found.
This is easily handles automatically by getting yaffs to automatically delete
files in lost and found.

There have been some modification made to yaffs to reduce MLC problems. In the
past, yaffs would start writing a block from where it left off. If the last
page written was n, then the next page written after a reboot bould be n+1 in
the same block. Concerns were raised that interrupted writes could leave the
flash in a state where using the next page could cause problems. This caused
me to change things so that after a reboot yaffs always starts writing on a
fresh block.

There are also some steps that can be taken to minimise the problem. Many, if
not most, platforms should have sufficient residual power in the power supply
to complete a NAND write once it has started. Page corruption is thus caused
by one of two things:
1) The WE* line is connected to the power fail/reset circuitry. This causes
the write to be interrupted as soon as the reset condition is met. Bad idea.
Rather hook the WE line to ground.
2) Have the flash driver test for power good before issuing the write command.
A flash programming sequence is of the form:
i. Issue latch set up command
ii. Write address and data into latch.
iii. Issue programming command
iv. Wait for write complete

A power failure any time before iii does not corrupt NAND. Problems occur
because the power fails during the actual programming.

Thus, if a power good flag is visible to the CPU then the following sequence
can be used to make the above power safe:

i. Issue latch set up command
ii. Write address and data into latch.
iii. Wait until power good.
iv. Issue programming command
v. Wait for write complete


-- Charles