On Thursday 08 December 2005 00:58, Jon Masters wrote:

> >
> > The Linux cache is write-through so these calls were observed to be slow
> > under Linux too and enabling the short op cache fixed the problem. From
> > then on, the shortopcache has been enabled by default.
>
> I agree with that. In the testing I did it was reading and writing
> very large files sequentially (inline with the requirements) but I can
> see the possible problem.
>
> > 2) Perhaps only using the short op cache for write operations would be
> > the best way to do things under Linux?
>
> That would be the best thing to do. Otherwise we just waste time on
> reads when the page cache will get populated by YAFFS2 anyway after a
> readpage.
>
> > > * YAFFS2 memory allocation using kmalloc does not work on very large
> > >   devices and needs to use vmalloc instead in those cases (>2GB
> > > devices). The lack of checking for success proves to be a problem.
> >
> > I think this only impacts on the creation of the huge chunk bitmap
> > structure. If so, this was dealt to in
> > http://www.aleph1.co.uk/cgi-bin/viewcvs.cgi/yaffs2/yaffs_guts.c?r1=1.20&r
> >2=1.21 Andre tested this, IIRC, and this fixed the problem.
>
> His hack would seem to fix that problem.
>
> > Is more required?
>
> I think YAFFS2 wants to decide how it is allocating memory. We have a
> limit on vmalloc space too (though it's pretty big) so getting away
> from unbounded allocations and having smaller buffers may become
> necessary on very large devices.

The Bluewater (Andre's) hack, IIRC, substituted vmalloc for kmalloc on a 
global basis and seemed to work for them (I live in the same town and am 
personal friends with some of these guys so I tend to hear about probs 
quickly :-)). I did have a concern that vmallocing might be limited or more 
ponderous than kmalloc, so changed the strategy. As it is now, I try kmalloc 
always, but in the one or two places where it can fail because of the 128kB 
size issue, I try vmalloc if kmalloc fails. This, I think, is the best 
approach since it will use kmalloc for the bulk of allocations (tnodes etc).

For Linux I guess looking at slab allocation rather than self-managed might be 
a good idea in the future.

>
> > Yes, definitely the handling of alloc failures is a bit sloppy.
>
> That is the main problem - you don't know things are failing until you
> guess that's what is happening (reading comments along the lines of
> "we should probably check if this fails" was helpful, I'll grant).
>
> > > * YAFFS2 has various internal usage of types which makes it difficult
> > > to scale to >2GB devices. We have to divide up into multiple
> > > partitions.
> >
> > Can you give some details? I would like to fix this. There are some
> > places where where ints are being used where off_t would be correct.
>
> That sort of thing. I started doing wholesale replacements but YAFFS2
> is corrupting kernel memory and causing untold troubles when devices
> are over 2GB. There seem to be a few places that I missed and I didn't
> have a continued brief to look at it - certainly I'd go through and
> fix this use of ints (and typecasts).

The code is pretty much layered in that at the file level it works in bytes, 
but below this it uses a chunk model, then below that (when talking to mtd) 
it uses bytes again.

It probably makes sense to use off_t for byte addresses (ie.vfs interfacing 
and mtd interfacing) and ints are probably OK for chunks for a while (2^^31 
chunks == 4TB or so).

When YAFFS was first written 32MB was big and 128MB was huge :-).

>
> > The chunkGroupBits issue also has impact on this.
> >
> > > * Andre Renaud latched onto a problem which I then rediscovered in
> > >   performance testing. Having chunk groups of 16 reduces performance by
> > >   at least 50% but in practice can be much higher. By applying a
> > > version of his patch, I was able ot reduce read time for a 50MB file
> > > from 27 seconds to around 15 seconds and have achieved sustained reads
> > > at 22.2Mbit/s on multi-GB devices reading many hundred MBs.
> >
> > I have written some code (minor testing so far, more testing and checkin
> > within 24 hours I hope) which should fix this.
> >
> > This code uses variable size bitmaps to fit the required bit width, thus
> > eliminating chunkgroups, but does not use as much RAM as the Bluewater
> > hack.
>
> I saw your postings. I think that is a *much* better idea since it
> will increase performance by 50-100% for some people. I combined that
> hack with a couple of other fixes and a DMA enabled MTD to push
> performance by over 200% of what it was when I started working on it.
>
> > > * YAFFS2 makes use of some additional reads and memcpy's which don't
> > >   seem entirely necessary - by combining and changing some of the logic
> > >   it looks like we could get another 10% performance gain.
> >
> > Very much look forward to more info on this.
>
> Ok. I'll look into that. There are several times where we call the MTD
> read where once would do (with some extra logic) and a few memcpy's
> where I think Linux could deal with a direct pointer instead (the MTD
> layer should handle the cacheing issues and memory coherence problems
> by doing any additional copies).
>
> > The WinCE stuff has some extra copying (that is actually no longer
> > required and will be eliminated). I hoped the Linux stuff was not doing
> > too much exta work.
>
> Not too much, but I took out one extra read (I'll track it down) and
> got a speed bump of around 5-10% in one go. A few more of those (it's
> worth someone sitting down and pouring over this code if there is
> justification) and we we've got free extra speed. Certainly YAFFS2 is
> now approaching the raw NAND performance when reading and writing
> through /dev/mtd/blah and that is the goal.

If you can provide a patch or annotated C file I will be most grateful. While 
I will go through this all at some stage (soon I hope), it is always good to 
have someone "look over your shoulder".


-- Charles