	This is revision number 3 of the clustering diffs.
New features this time around:

	1) A real, honest to god free list is now present.
The buffers in it are guaranteed to be clean, unlocked and not
shared with any other process.  There is a refill function
that keeps it supplied with buffers - right now it is written
to supply 64 buffers any time it is called.  Also, there is a separate
free list for each different size of buffer, but the LRU list is
common for all of the different sizes.

	2) A bdflush process is now present, and runs in the
background when we need to write back some dirty buffers.  Currently
this only scans at most 1/4 of the buffer cache, and will write back
at most 500 buffers, which ever comes first.  These numbers are
wild-assed guesses as to what would be appropriate, and tuning would
probably help. A interactive method of altering parameters might also
be good.  Note: you currently need to run the process in rc.  It may
eventually be possible to get bdflush started automatically without
having to run a process, but there are a lot of tricky and subtle
issues at hand here.  The source code for bdflush is at the end of
this message.


	3) iozone on a naked partition consistently now yields numbers
like 1.1-1.4Mb/sec.  I believe that further tuning would be good in
order to improve performance.  In particular, if there is a big wad of
dirty buffers coming through the LRU list, we do not detect this until
it gets to the top.  At this point we wake up bdflush(), but until
bdflush finishes, we have to crawl past this wad each time the refill
function is called.  Even then, the refill function supplies 64
buffers so the penalty is nowhere near as bad as it once was.  Some
further adjustment of the amount of data that bdflush writes back
would certainly be good, I guess.

	There is code in buffer.c to generate clusters, and it is now
used by the block device code.  I am finding that it is not terribly
efficient to search for a page that we can reclaim, so it is best to
limit the search to only a fraction of the buffer cache.  Currently
this is set to 25%, I may back this off a little bit more.  This is
a tuning parameter that can be modified at run time via the bdflush()
syscall interface.

	The only thing left to do is to modify the filesytems to
request clustered buffers.  In the block devices, I basically do
something like:

	if((block % 4) == 0) generate_cluster(dev, block, blocksize);

which as I look at it now is incorrect because it assumes a 1024 byte
blocksize.  Nonetheless, once this is fixed, it could be added
directly to getblk so that we always request clustered buffers.  It
would be good if the filesystems were to try and align things on
cluster boundaries, but as I understand it, ext2 tends to keep files
contiguous so it probably should not matter that much.

	One concern that have with this is the overhead of searching
for a page that can be reclaimed to be used for a new cluster.  I am
toying with the idea of discouraging the buffer cache from breaking
apart clusters so that things are always done on a page basis.  In
fact, the buffer cache would be reorganized so that things are
generally done by handling pages.  This would speed up a number of
parts of the buffer cache, but the filesystems are still expecting
buffer headers.  Linus was also thinking along these lines, and as I
look at it now, it is beginning to make more and more sense to me.
There are still some things that need to be thought out before I can
go ahead, but I suspect that on the whole it will lead to better
performance.


-Eric

