Charles,

I don't think the mapping thing is the case. Before this problem was found, my testing involved filling up the entire 190gb partition and draining it out one file at a time while validating each file. This would happen several times successfully. 

Our latest theory was that it was continuing to write in a low-power state and maybe ending up writing random stuff in potentially random locations. Our last revision of the driver should fix a bug we found that might allow this to happen even if it was writing garbage and nothing has changed. That said, I have done a nanddump of the blocks earlier in the address range than the file being written and validated their contents. None of the previously existing data is being corrupted during the write. Hence why I can mount the partition read-only and read back all of the files.

There is a smaller partition (~400 blocks instead of ~47500) which I tried using for testing. I cannot cause the failure in this smaller partition no matter what I try. A smaller one seems even more unlikely to see a failure. I'd also have to deliberately slow down our driver to cut power in time for a write under 80MB in size (doable, but might obscure the problem). 

I've used nandwrite/nanddump to write/read this partition extensively, yes. I never lose data this way, aside from the partial write happening during the powerloss. The data is only erased after yaffs marks all the blocks as unused and performs garbage collection.

My colleague is in the process of modifying UBIFS to work with DMA such that we can test if the problem still exists with a different filesystem...

Thanks,
Hunter

On Mon, Feb 20, 2017 at 2:27 PM, Charles Manning <cdhmanning@gmail.com> wrote:


On Tue, Feb 21, 2017 at 8:17 AM, Hunter Somerville <hsomervi5790@gmail.com> wrote:

On Thu, Feb 9, 2017 at 4:23 PM, Charles Manning <cdhmanning@gmail.com> wrote:
Hi Hunter

On Fri, Feb 10, 2017 at 8:57 AM, Hunter Somerville <hsomervi5790@gmail.com> wrote:
On Tue, Feb 7, 2017 at 3:44 PM, Charles Manning <cdhmanning@gmail.com> wrote:
On Tue, Feb 7, 2017 at 5:19 AM, Hunter Somerville <hsomervi5790@gmail.com> wrote:
Hello,

We are encountering an issue where we will usually lose an entire partition of data if the flash device loses power during a write operation. When we bring the system back up and remount, all files/directories appear as long strings of questionmarks with incorrect filenames and such, and we end up having to flash erase the partition to recover. This only happens on the device with fairly large pages (4MB Erase blocks, 32KB pages, 1KB OOB), and does not occur on the more typical device in the same system which uses 4KB pages.

What kind of flash are you using? What part number?

The hardware is proprietary, and not designed by us. What I can tell you is that we interface with an FPGA - not the flash chips directly. The FPGA performs the writes.

Surely the flash parts are off the shelf.

I'm getting permission on this. They're Samsung parts.

We've discovered that mounting the partition as read-only after powerloss demonstrates that the data is all present and correct, aside from the file which was actively being written. I can read back any of the files and verify their contents. If at any point I mount this partition as read-write after the powerloss, yaffs appears to mark all blocks as unused and then proceeds to garbage collect every block. My files all slowly disappear.

yaffs: Collecting block 3, in use 1, shrink 0, whole_block 0
yaffs: Collecting block 3 that has no chunks in use
yaffs: yaffs_block_became_dirty block 3 state 8
yaffs: yaffs_tags_marshall_read chunk 256 data ef1f0000 tags ef6f5cd8
yaffs: packed tags obj -1 chunk -1 byte -1 seq -1
yaffs: ext.tags eccres 1 blkbad 0 chused 0 obj 0 chunk0 byte 0 del 0 ser 0 seq 0
yaffs: yaffs_tags_marshall_read chunk 257 data ef1f0000 tags ef6f5cd8
yaffs: packed tags obj -1 chunk -1 byte -1 seq -1
yaffs: ext.tags eccres 1 blkbad 0 chused 0 obj 0 chunk0 byte 0 del 0 ser 0 seq 0
.......
yaffs: yaffs_tags_marshall_read chunk 382 data ef1f0000 tags ef6f5cd8
yaffs: packed tags obj -1 chunk -1 byte -1 seq -1
yaffs: ext.tags eccres 1 blkbad 0 chused 0 obj 0 chunk0 byte 0 del 0 ser 0 seq 0
yaffs: yaffs_tags_marshall_read chunk 383 data ef1f0000 tags ef6f5cd8
yaffs: packed tags obj -1 chunk -1 byte -1 seq -1
yaffs: ext.tags eccres 1 blkbad 0 chused 0 obj 0 chunk0 byte 0 del 0 ser 0 seq 0
yaffs: Erased block 3

I can't yet figure out why it's marking these blocks as unused when there are clearly files present. Any help on this matter would be greatly appreciated.

Hello Hunter

That sounds pretty weird.

The only time I've ever seen something like that happen was when there was a bug in the driver so that the flash got mapped twice. (ie the  driver said the part was, say, 32 MB but was actually just accessing the first 16MB twice).

When you get issues like this it is often also a good thing to first just try a small partition (say 20 blocks). That way there's a lot less detail and you can maybe spot the patters quicker.

If you're using Linux, have you tried testing the drivers by just using the mtdtools to run tests?

-- Charles