Re: [Yaffs] Complete partition data loss on powercut during write

Attachments:
Message as email (text/plain) (text/html)

Author: Charles Manning
Date:
To: Hunter Somerville
CC: yaffs
Subject: Re: [Yaffs] Complete partition data loss on powercut during write

On Tue, Feb 21, 2017 at 9:08 AM, Hunter Somerville <hsomervi5790@gmail.com>
wrote:

> Charles,
>
> I don't think the mapping thing is the case. Before this problem was
> found, my testing involved filling up the entire 190gb partition and
> draining it out one file at a time while validating each file. This would
> happen several times successfully.
>
> Our latest theory was that it was continuing to write in a low-power state
> and maybe ending up writing random stuff in potentially random locations.
> Our last revision of the driver should fix a bug we found that might allow
> this to happen even if it was writing garbage and nothing has changed. That
> said, I have done a nanddump of the blocks earlier in the address range
> than the file being written and validated their contents. None of the
> previously existing data is being corrupted during the write. Hence why I
> can mount the partition read-only and read back all of the files.
>
> There is a smaller partition (~400 blocks instead of ~47500) which I tried
> using for testing. I cannot cause the failure in this smaller partition no
> matter what I try. A smaller one seems even more unlikely to see a failure.
> I'd also have to deliberately slow down our driver to cut power in time for
> a write under 80MB in size (doable, but might obscure the problem).
>

47500 blocks of 4MB (=128 pages of 32k) is a total of around 6 million
pages. That's pretty large. I'll do some calcs to see if this could be a
number space issue.

Sometimes it also helps to turn off some features to see if that makes a
difference. It's not that I recommend running with those features off, but
just trying to isolate the issue.

The two major features are checkpoint and block summaries.

> I've used nandwrite/nanddump to write/read this partition extensively,
> yes. I never lose data this way, aside from the partial write happening
> during the powerloss. The data is only erased after yaffs marks all the
> blocks as unused and performs garbage collection.
>
> My colleague is in the process of modifying UBIFS to work with DMA such
> that we can test if the problem still exists with a different filesystem...
>

If you can't make it fail with mtdtools then a filesystem should not change
things.

Charles

> Thanks,
> Hunter
>
> On Mon, Feb 20, 2017 at 2:27 PM, Charles Manning <cdhmanning@gmail.com>
> wrote:
>
>>
>>
>> On Tue, Feb 21, 2017 at 8:17 AM, Hunter Somerville <
>> hsomervi5790@gmail.com> wrote:
>>
>>>
>>> On Thu, Feb 9, 2017 at 4:23 PM, Charles Manning <cdhmanning@gmail.com>
>>> wrote:
>>>
>>>> Hi Hunter
>>>>
>>>> On Fri, Feb 10, 2017 at 8:57 AM, Hunter Somerville <
>>>> hsomervi5790@gmail.com> wrote:
>>>>
>>>>> On Tue, Feb 7, 2017 at 3:44 PM, Charles Manning <cdhmanning@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> On Tue, Feb 7, 2017 at 5:19 AM, Hunter Somerville <
>>>>>> hsomervi5790@gmail.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> We are encountering an issue where we will usually lose an entire
>>>>>>> partition of data if the flash device loses power during a write operation.
>>>>>>> When we bring the system back up and remount, all files/directories appear
>>>>>>> as long strings of questionmarks with incorrect filenames and such, and we
>>>>>>> end up having to flash erase the partition to recover. This only happens on
>>>>>>> the device with fairly large pages (4MB Erase blocks, 32KB pages, 1KB OOB),
>>>>>>> and does not occur on the more typical device in the same system which uses
>>>>>>> 4KB pages.
>>>>>>>
>>>>>>
>>>>>> What kind of flash are you using? What part number?
>>>>>>
>>>>>
>>>>> The hardware is proprietary, and not designed by us. What I can tell
>>>>> you is that we interface with an FPGA - not the flash chips directly. The
>>>>> FPGA performs the writes.
>>>>>
>>>>
>>>> Surely the flash parts are off the shelf.
>>>>
>>>
>>> I'm getting permission on this. They're Samsung parts.
>>>
>>> We've discovered that mounting the partition as read-only after
>>> powerloss demonstrates that the data is all present and correct, aside from
>>> the file which was actively being written. I can read back any of the files
>>> and verify their contents. If at any point I mount this partition as
>>> read-write after the powerloss, yaffs appears to mark all blocks as unused
>>> and then proceeds to garbage collect every block. My files all slowly
>>> disappear.
>>>
>>> yaffs: Collecting block 3, in use 1, shrink 0, whole_block 0
>>> yaffs: Collecting block 3 that has no chunks in use
>>> yaffs: yaffs_block_became_dirty block 3 state 8
>>> yaffs: yaffs_tags_marshall_read chunk 256 data ef1f0000 tags ef6f5cd8
>>> yaffs: packed tags obj -1 chunk -1 byte -1 seq -1
>>> yaffs: ext.tags eccres 1 blkbad 0 chused 0 obj 0 chunk0 byte 0 del 0 ser
>>> 0 seq 0
>>> yaffs: yaffs_tags_marshall_read chunk 257 data ef1f0000 tags ef6f5cd8
>>> yaffs: packed tags obj -1 chunk -1 byte -1 seq -1
>>> yaffs: ext.tags eccres 1 blkbad 0 chused 0 obj 0 chunk0 byte 0 del 0 ser
>>> 0 seq 0
>>> .......
>>> yaffs: yaffs_tags_marshall_read chunk 382 data ef1f0000 tags ef6f5cd8
>>> yaffs: packed tags obj -1 chunk -1 byte -1 seq -1
>>> yaffs: ext.tags eccres 1 blkbad 0 chused 0 obj 0 chunk0 byte 0 del 0 ser
>>> 0 seq 0
>>> yaffs: yaffs_tags_marshall_read chunk 383 data ef1f0000 tags ef6f5cd8
>>> yaffs: packed tags obj -1 chunk -1 byte -1 seq -1
>>> yaffs: ext.tags eccres 1 blkbad 0 chused 0 obj 0 chunk0 byte 0 del 0 ser
>>> 0 seq 0
>>> yaffs: Erased block 3
>>>
>>> I can't yet figure out why it's marking these blocks as unused when
>>> there are clearly files present. Any help on this matter would be greatly
>>> appreciated.
>>>
>>
>> Hello Hunter
>>
>> That sounds pretty weird.
>>
>> The only time I've ever seen something like that happen was when there
>> was a bug in the driver so that the flash got mapped twice. (ie the driver
>> said the part was, say, 32 MB but was actually just accessing the first
>> 16MB twice).
>>
>> When you get issues like this it is often also a good thing to first just
>> try a small partition (say 20 blocks). That way there's a lot less detail
>> and you can maybe spot the patters quicker.
>>
>> If you're using Linux, have you tried testing the drivers by just using
>> the mtdtools to run tests?
>>
>> -- Charles
>>
>>
>

This message is part of the following thread:
	the complete thread tree sorted by date
	Hunter Somerville at
	Hunter Somerville at

Re: [Yaffs] Complete partition data loss on powercut during …