





     UUsseerrffss -- FFiilleessyysstteemmss IImmpplleemmeenntteedd aass UUsseerr PPrroocceesssseess
           _J_e_r_e_m_y _F_i_t_z_h_a_r_d_i_n_g_e _<_j_e_r_e_m_y_@_s_w_._o_z_._a_u_>
                     Softway Pty. Ltd.



_1_.  _I_n_t_r_o_d_u_c_t_i_o_n

Userfs  is a mechanism by which normal user processes can be
a Linux filesystem.  There are many uses for  this,  includ-
ing:

Prototype filesystems

     Prototype  new  block  allocation  algorithms in a user
     process and debug with gdb before going into  the  com-
     pile-crash-reboot cycle of kernel development.

Infrequent use filesystems

     You  want to mount "FooBaz 0X" filesystems under Linux,
     but you don't want it that often, and you don't need it
     to  be  maximum  speed.   Rather than trying to get the
     kernel  itself  to  understand,  or  write  specialised
     tools, write a filesystem program.

Add capabilities to existing filesystems

     Want  compression, encryption, ACLs?  Have a process to
     mirror an existing file tree, but with your own  exten-
     tions and semantics.

Completely virtual filesystems and new interfaces

     Add  a  filesystem-type interface to an existing mecha-
     nism, or a filesystem interface as a new way of  repre-
     senting data.  Sick of FTP?  How about

          $$ mmkkddiirr //ffttpp//ttssxx--1111..mmiitt..eedduu
          $$ ccdd //ffttpp//ttssxx--1111..mmiitt..eedduu//ppuubb//LLiinnuuxx
          $$ ccpp RREEAADDMMEE $$HHOOMMEE

     Or mail?




















                            - 2 -



          $$ ccdd //mmaaiill
          $$ llss
          000011..ssbbgg@@ssooccss..uuttss..eedduu..aauu
          000022..LLeerrooyy
          000033..ttlluukkkkaa@@vviinnkkkkuu..hhuutt..ffii
          000044..DDaavvoorr__JJaaddrriijjeevviicc
          $$ ccaatt **//FFrroomm
          FFrroomm:: ssbbgg@@ssooccss..uuttss..eedduu..aauu
          FFrroomm:: lleerrooyy@@ssooccss..uuttss..eedduu..aauu ((LLeerrooyy))
          FFrroomm:: ttlluukkkkaa@@vviinnkkkkuu..hhuutt..ffii
          FFrroomm:: ddaavvoorr%%eemmaarrdd..uuuuccpp@@ddss55000000..iirrbb..hhrr ((DDaavvoorr JJaaddrriijjeevviicc))
          $$ ccaatt **//SSuubbjjeecctt
          SSuubbjjeecctt:: MMoorree tthhiinnggss
          SSuubbjjeecctt:: ((nnoonnee))
          SSuubbjjeecctt:: TThhaatt uusseerrffss tthhiinngg
          SSuubbjjeecctt:: mmaaiillffss aaggaaiinn
          $$

You get the idea.


_2_.  _I_n_s_t_a_l_l_a_t_i_o_n

_2_._1  _K_e_r_n_e_l

First  of  all, remove traces of previous verions of userfs:
make  sure   there   are   no   userfs   header   files   in
_l_i_n_u_x_/_i_n_c_l_u_d_e_/_l_i_n_u_x and no userfs patches to any of the ker-
nel source.

Apply the  patch  "userfs.diff"  in  the  normal  way  (it's
against  a  1.1.44 kernel).  Do a "make config; make depend;
make", and install.  It is not necessary to copy  any  files
into the linux source tree; only the patch is required.

Since  this is an _a_l_p_h_a release, you should know what you're
doing, and know how to fix up  simple  compilation  problems
with  the  kernel.  I'd be surprised if it just worked.  One
thing you may have problems with is config.in:  some  of  my
private changes have leaked in there, and you're almost cer-
tainly going to have local changes,  so  it's  unlikely  the
patch  will  be  clean.  The only important change is adding
the CONFIG_USERFS_FS line to the end of the filesystem  sec-
tion.

_2_._2  _N_o_n_-_k_e_r_n_e_l _C_o_d_e

Building  the  rest of the code should be a matter of typing
"make" at the  top  userfs  directory.  This  will  generate
dependencies  and  build  the utilities needed (genser), the
library, the clients using the library and the  kernel  mod-
ule.

This  version is a loadable kernel module.  When you specify








                            - 3 -



"yes" to the userfs question, it doesn't put the  filesystem
code  into  the  kernel;  it only puts some symbols into the
module symbol table needed at module load time.  I  hope  to
eliminate  the  need  for  any special changes to the kernel
soon.

If you've compiled your kernel with gcc 2.6.0, make sure you
compile  the  module  with gcc 2.6.0 as well.  Compiling the
C++ code with gcc 2.6.0 doesn't work.  Use 2.5.8, or perhaps
2.6.1 (I haven't tried it though).  To do this, type

     mmaakkee ''CCXXXX==ggcccc --VV22..55..88''

To  install  the module you need the mmoodduuttiillss package, which
should be available from your local Linux ftp  archive.   It
should  be clear from it's documentation what you need to do
with _u_s_e_r_f_s_._o to get it into the kernel.  If  you  get  some
warnings  about multiply defined symbols, ignore them.  Only
undefined symbols are a problem.

_2_._3  _M_a_i_l_i_n_g _l_i_s_t

There is a  USERFS  channel  on  the  Linux  Activists  list
server.  To subscribe, send mail with

     XX--MMnn--AAddmmiinn:: jjooiinn UUSSEERRFFSS

as the first line to linux-activists-request@niksula.hut.fi.
This channel is for general discussion of userfs development
and application.

_2_._4  _B_u_g_s_, _c_o_m_m_e_n_t_s_, _e_t_c

When  you  find  a  bug,  tell  me.  Please send me the code
you're using, the kernel version,  whatever  changes  you've
made  to userfs kernel code, and instructions or a script to
reproduce the bug.  Don't just tell me "it broke."

If you've made changes to the kernel code, please send it to
me  rather than sending it out to the world.  Please send me
comments, ideas for new kernel features, or things that  you
think  would  make  good  filesystems but you can't do right
now.  Also feel free  to  ask  questions  about  either  the
implementation  of  my  code or how to write your own userfs
clients.

Send   mail    to    either    me    (Jeremy    Fitzhardinge
<jeremy@sw.oz.au>) or to the mailing list (see above).


_3_.  _U_s_i_n_g _c_l_i_e_n_t_s

Clients  are  generally  mounted  with  the mmuusseerrffss command.
It's quite simple - it's a  program  which  makes  sure  the








                            - 4 -



mount  point  is  legal for the user to mount on, and mounts
the given process with the user's  permissions.   Note  that
any  user  can  mount a process, so more checking is done on
the mount point than for a normal mount.  Unless the user is
root,  the  mount  point  must  be  owned  by  the  user and
writable.  mmuusseerrffss has a man page, which is even up to date.

There are a few useful or semi-useful clients: hhoommeerr, ffttppffss,
mmaaiillffss and aarrccffss..

Homer is written in C++, and uses the C++ library in the lib
     directory  to  do most of its work.  All it does is set
     up a single directory under its mount point which  con-
     tains  symbolic links named after each user name in the
     password file, which  points  to  the  associated  home
     directory.   Mounted on /u it makes a passible replace-
     ment for ~ expansion in a shell except it works for any
     program.

Ftpfs is  an  experimental  filesystem which allows readonly
     access to  FTP  sites,  maintaining  a  long-term  disk
     cache.   Its  intended primarily for anonymous FTP, but
     can also be used for authenticated FTP sessions.

Mailfs is by Davor Jadrijevic.   It  is  for  reading  mail.
     Currently  its  read-only  and  does  not track mailbox
     changes, but it is being actively developed.

Arcfs was written by David Gymer.  It allows you to mount  a
     compressed  tar  file  as  a  read-only filesystem, and
     inspect it with normal tools.  It's pretty neat.


_4_.  _T_h_e_o_r_y _o_f _o_p_e_r_a_t_i_o_n

The kernel patch and module create  a  new  filesystem  type
into  the  kernel ("userfs").  The filesystem itself is very
simple; all it does it takes the  normal  kernel  filesystem
requests,  wraps  them  up  into  well-defined  packets  and
squirts them down a file descriptor  (presumeably  connected
to  a  process)  and  waits  for  the  reply on another file
descriptor.

If the filesystem process is on the same machine,  then  the
file  descriptors  are  probably  ordinary  pipes.  However,
userfs just reads and writes on  the  file  descriptors,  so
they  could  be  anything;  files, sockets, devices - userfs
doesn't care.

The following is not a  comprehensive  tutorial  on  writing
filesystems,  or  a detailed "how it works" or specification
of the existing code.  It is intended to give some  idea  of
what  I  was  thinking,  and  basic concepts to bear in mind
while poking about in my kernel or user code.








                            - 5 -



_4_._1  _P_r_i_o_r_i_t_i_e_s

I had a number of goals which I  wanted  satisfied  by  this
thing (from most to least important):

Flexibility

     I  want  the process to have as much power as a kernel-
     resident filesystem as possible.  I wanted to keep  the
     interfaces  as  generic  and  flexible.   This has been
     mostly achieved.

Robustness

     Since I see prototyping and development a major use for
     userfs, it seems important to make sure that the kernel
     code can't (at worst) crash or lock up if the user code
     fails.   As  it  stands,  it should be impossible for a
     user process to crash the kernel, but  it  is  possible
     for  a  bad user process to lock up processes trying to
     use the filesystem.

     It is also possible for a process to go  strange  while
     it is being mounted, leaving a half-mounted filesystem.
     The mountpoint becomes a nulled out inode, but the ker-
     nel  refuses  to unmount it (because it isn't mounted),
     and refuses to mount on it (because it's  busy).   This
     happens  much  less  often  than  it  used  to, because
     muserfs does a simple check to see  if  the  filesystem
     process is at all viable.

Availability to users and Security

     I'd like any user to be able to write a filesystem pro-
     cess.   Traditionally,  filesystems  are  things   that
     embody  the  security  of  Unix, and are therefore very
     much superuser-only things.  However, there are only  a
     couple  of  really sensitive features that shouldn't be
     able to be controlled by any user: suid executables and
     device  nodes.   Since  a  trusted superuser process is
     still required to call the mount system call, and  that
     process  can  set  the no-suid and no-device flags, the
     filesystem code can't use these as security  holes.   I
     can't  think  of  anything else that needs special care
     from a security point  of  view.   However,  since  the
     filesystem  is completely under the control of the pro-
     cess, one can make no assumptions about  its  contents.
     For  example  "."  and ".." may not do expected things,
     symlinks may point to places other than  what  readlink
     returns.   This makes navigating such filesystems a new
     and interesting experience.

Efficiency









                            - 6 -



     Efficiency is my  lowest  priority,  but  it  is  still
     important.   Unfortunately  the  other requirements (as
     usual) make things less efficient.  The  most  signifi-
     cant  inefficiency  is the context switches between the
     kernel and the process.  I think the most benefits  can
     be  gained  by reducing the number of these.  There are
     several approaches to this:

        +o If the process wants a well-defined behaviour  for
          an  operation,  then it should be done in the ker-
          nel.  The  best  example  of  this  is  permission
          checking  -  if the process wants normal unix per-
          mission checking then it doesn't  need  to  do  it
          itself.   Otherwise it can take all the permission
          requests from the kernel, and implement other per-
          mission  policies.  This is currently implemented.
          When the filesystem is first mounted,  the  kernel
          asks  the  process  what  requests it will accept.
          From  that  point  the  kernel  will  do  sensible
          default  actions  for  requests  that  the process
          doesn't want to handle rather  than  sending  them
          down the connection.

        +o Group  requests commonly issued together into one.
          This is hard, since  the  main  kernel  tells  the
          filesystem  code  very  little  about  what  it is
          doing, so it is hard to  know  what  to  do  next.
          However,  there  are  a  couple  of  single kernel
          requests that are implemented in the  protocol  as
          two  or more transactions.  This could be fixed in
          future.

        +o Data can be cached in the  kernel.   This  is  the
          most  tricky,  since  kernel caching or read-ahead
          limits the amount of control the process can  have
          over  the  data  once read.  I think this could be
          optionally implemented, depending on  whether  the
          process  says  it  is  OK to do caching, and if so
          what kinds.

          Currently directory readahead is implemented  with
          the  uupppp__mmuullttiirreeaaddddiirr  operation.  This allows the
          filesystem process to  return  as  many  directory
          entries  as  it  likes.   These  entries are saved
          attached to the directory  inode  in  the  kernel.
          Future  readdir  requests  look  in  the readahead
          buffer before sending an operation to the filesys-
          tem.   If it fails to find the required entry then
          it  dumps  the  readahead  buffer  and  asks   the
          filesystem  process again.  This is a win if there
          are lots of linear  directory  searches  (such  as
          shell globbing, ls or pwd).










                            - 7 -



        +o A  larger than 4k maximum packet size can be used,
          now that the kernel memory allocator allows larger
          than  4k memory allocations.  However, since pipes
          are the most common connection beween  filesystems
          and  kernels,  and  pipes  can  hold at most 4k of
          data,  there  would  still  be  a  context  switch
          between  filesystem  code  and kernel every 4k, so
          there wouldn't be much gain.

     A number of people have suggested adding shared  memory
     between  the  kernel  and the filesystem process.  This
     would be quite limiting  and  least  likely  option  to
     improve things.  At the moment, the filesystem makes no
     assumptions about the nature of  the  file  descriptors
     for talking to the process.  To implement shared memory
     between the kernel and the process would  require  some
     way of finding the process on the other end of the file
     descriptors (if any), and playing  around  with  memory
     maps.   This  still  wouldn't cut down on the number of
     context switches at all.

_4_._2  _P_r_o_t_o_c_o_l

The protocol used is machine independent, using network byte
order and defined type sizes.  The code to do the packetisa-
tion and depacketisation is  generated  automatically  by  a
program,  given the description of each packet.  This is not
fully portable, but  it  avoids  byte  order  and  structure
alignment problems.

A  packet to or from the kernel has two parts.  The first is
a header that contains a sequence number, an operation type,
a  packet  type,  size of the following data, and a protocol
version number.  The packet type can either be a request,  a
reply or an enquiry.  Requests and enquiries are always from
the kernel to the process, and the process only  ever  sends
replies to the kernel.  A reply's header has one extra field
- an error  field,  containing  an  error  number.   Replies
always  have the same sequence number as their corresponding
request or enquiry.  If there was an  error  performing  the
operation  the  error  field  is set to the error number and
there is no additional data returned.  If there is no  error
the error field is set to 0.

Following  a  request or reply packet is the optional opera-
tion-specific data.  This is passed through the protocol for
interpretation by the operation routines at each end.

The kernel may have multiple outstanding requests.  In other
words, the kernel may send a new request before receiving  a
reply  to  a  previous  one.   This allows the filesystem to
block one process for a slow operation while other processes
can   use  the  filesystem  for  shorter  operations.   This
improves performance on, for  example,  an  ftp  filesystem,








                            - 8 -



where  one  process  may  be  using  a  fast local link, and
another may be using a slow international one, and each  has
to  wait  for  its  own requests to be satisfied.  Of course
this requires the filesystem process to be written with some
form   of   multi-threading.   If  the  process  just  reads
requests, acts on them and replies then it  can  do  so  and
ignore  any  kernel  requests until it is ready to deal with
them.

_4_._3  _H_a_n_d_l_e_s

The base element of a filesystem is an _i_n_o_d_e.  There  is  an
exact  one  to  one  relationship  between  inodes and files
(where a _f_i_l_e in this case can  be  any  filesystem  object,
like  a  normal  file,  a  directory and so on).  The kernel
needs to be able to uniquely identify  inodes.   Inodes  are
uniquely  numbered  within  a  filesystem,  but each mounted
filesystem has its own numbering.   Therefore  an  inode  is
completely  identified  by  an inode number and a filesystem
identifier (or _d_e_v_i_c_e, even though it doesn't mean much  for
a filesystem which is not on a disk).

A  device is what distinguishes mounted filesystems from one
another, and an inode is what distinguishes files  within  a
filesystem  from each other.  Inode numbers are generated by
each filesystem, and are used by the kernel to refer to spe-
cific  files  to the filesystem specific code.  User process
filesystems are no exception: between  the  kernel  and  the
filesystem  process,  files are refered to by using _h_a_n_d_l_e_s,
which are essentially 32 bit unsigned numbers.  When a  pro-
cess first mentions a file to the kernel, it gives it a han-
dle, which the kernel uses for all later operations  on  the
file.   It  the the handle which identifies the file, rather
than the name, so it is important to  use  distinct  handles
for  distinct  files,  and never change the handle of a file
once it has been given to the kernel.

_4_._4  _R_a_n_d_o_m _o_p_e_r_a_t_i_o_n _s_p_e_c_i_f_i_c _a_d_v_i_c_e _a_n_d _b_l_u_r_b

This may eventually accurately describe the whole  protocol,
but for now its a list of interesting points and things that
have bitten me.

Normally when  writing  a  filesystem  you  should  use  the
library  _l_i_b_u_s_e_r_f_s  (see  below), and use the advice in this
section as a guide on what kind of things should be  put  in
your userfs operation functions, or for idle curiosity.

_4_._4_._1  _M_o_u_n_t_i_n_g

The  mount  is initiated by a user process calling the mount
system call, with the  "userfs"  filesystem  type.   In  the
filesystem  specific  data,  the  process  passes  two  file
descriptor numbers for the kernel  to  read  and  write  to.








                            - 9 -



These  can by any kind of file descriptor at all.  Most com-
monly they would be  pipes  or  sockets,  but  there  is  no
restriction.   All the kernel requires that the one it talks
to the process with is writable, and the one it gets replies
from is readable.

The  most  important  request  is mounting.  Most important,
because it is the only  request  that  the  process  has  to
implement  (of  course, not implementing anything else would
be completely useless).  The request itself is not that com-
plex.   All  it  does is return a handle of the inode at the
root of the filesystem.   Most  commonly,  this  will  be  a
directory.   Userfs  does  not  enforce this, but the kernel
itself may.

After the process returns the root handle, the  kernel  will
probe  the  process  to see what operations it is willing to
support.  This is done by sending a series of enquire  pack-
ets  to  the  process.  The process should reply with normal
reply packets, with the errno field either set to 0 if it is
supported  or  ENOSYS if it isn't.  No real operation should
be done, and no additional information should be sent in the
reply.   If  the  process replies ENOSYS to an operation, it
will never receive it again, and the kernel will use a  sen-
sible  default  for it (typically what the kernel would nor-
mally do for an in-kernel filesystem if it  doesn't  support
the  operation).   Conversely,  if  the  filesystem  process
doesn't get an enquiry about a particular operation from the
kernel,  it  will  never see that operation from the kernel.
The filesystem process should send 0 for the  operations  it
explicitly  supports, and ENOSYS for everything else, so the
protocol can be extended without having to  modify  existing
clients.


_4_._4_._2  _R_e_a_d_i_n_g _I_n_o_d_e_s

The  most common thing for a filesystem to be asked to do is
to read inodes.  For the process, this involves filling  out
a  structure  much like the kernel's inode structure and the
stat structure.  It's important thing is to  make  sure  the
nlinks field is non-zero.  This field is the number of names
the inode has, that is, the number of directory  entries  in
the  filesystem  which refer to this inode.  In theory, this
can never be 0 when the kernel asks for the  inode,  because
that  means that the kernel asked for the inode without ever
seeing a name referring to it, implying that the  filesystem
never  told  the kernel about the file.  If it is 0 then the
kernel will never "put" the inode,  and  it  will  make  the
filesystem un-umountable.

When  the kernel wants an inode from the filesystem, it uses
the uupppp__iirreeaadd protocol request to fetch it.  This happens if
something  in  the  kernel  asks for the inode, but it isn't








                           - 10 -



already in the kernel inode table.  Therefore, once the ker-
nel  has  asked the filesystem for an inode, it will not ask
for it again while anything in the kernel is using it.

Once nothing in the kernel is using the  inode,  the  kernel
will  issue  an uupppp__iippuutt operation, which may be preceded by
an uupppp__iiwwrriittee if the inode was modified in use.  A  filesys-
tem  need not implement these operations if there is no need
to do so.

_4_._4_._3  _O_p_e_n _a_n_d _C_l_o_s_e

Reading and putting inodes are the basic operations: regard-
less  of what an inode is being used for it will be read and
put.  The uupppp__ooppeenn  and  uupppp__cclloossee  operations  specifically
correspond  to  the ooppeenn(2) and cclloossee(2) system calls.  Nor-
mally a filesystem doesn't need to perform any special  han-
dling for these operations, and would not normally implement
them, except if it wants to know the identity of the process
doing  the operations.  When a program issues an open system
call for a file on the user filesystem, the kernel will send
a  _u_p_p___o_p_e_n  operation for the file, which includes complete
identifcation for the process which issued the  open.   When
the filesystem replies it returns a _c_r_e_d_e_n_t_i_a_l_s _t_o_k_e_n_.  From
then on, that credentials token is sent to the filesystem in
all operations which correspond to a system call which takes
a   file   descriptor    as    an    argument,    such    as
rreeaadd,wwrriittee,rreeaaddddiirr,llsseeeekk and so on.

This  may  seem  a bit complex: why not just send the uid of
the process with the operations?  Well, the credentials of a
process  are  quite  complex,  since  they include the real,
saved and effective uids and gids of the  process,  and  all
the  auxillary groups.  Sending this with each request would
be quite an overhead.  The idea is that all the info is sent
on  a open, and the filesystem process can associate it with
a token internally, and only use the token in correspondance
with the kernel.

Also  note  that the credentials are associated with an open
file descriptor, not the process performing  the  operation.
Mostly a process will deal with file descriptors it has cre-
ated itself, but its quite possible that it can inherit file
descriptors  from  another  process  with a different set of
credentials.  In this case the filesystem knows the original
process's credentials, but not for the process which is per-
forming the operation.

_4_._4_._4  _H_a_n_d_l_e _M_a_n_a_g_e_m_e_n_t

The handle of an inode  is  only  way  the  kernel  and  the
filesystem  can  talk  about a file.  An inode may have more
than one name, or no names at all, so file names are  not  a
good  way  of  keeping  track of a file.  Use inodes in your








                           - 11 -



filesystem code to keep track of files, even if you  have  a
simple 1:1 name to file mapping.

Handles  must also be consistent.  Of course you must always
keep the handles of files currently in use  consistent,  but
you  must also keep them consistent between uses.  If a pro-
cess opens a file once, closes it and then reopens it,  then
it  will  expect  it  to have the same inode number if it is
supposed to be the same file (which is how processes using a
user filesystem will see the file handles).

Also,  if  you  ever refer to a handle in communication with
the kernel, you must be prepared for the kernel to ask about
it.   For  example, if the kernel reads a directory with the
uupppp__rreeaaddddiirr or uupppp__mmuullttiirreeaaddddiirr operations,  each  entry  in
the reply will have a name and a handle.  Each of those han-
dles must be the handle of the file if the kernel  looks  at
the  file  more closely.  If you make them all the same, for
example, then a program would be entitled  to  believe  that
all the names in the directory refer to one actual file.

_4_._4_._5  _D_e_a_l_i_n_g _w_i_t_h _m_u_s_e_r_f_s

Writing  a  client  which  can be handled by muserfs is very
easy.  The important thing to remember is that the  filesys-
tem  process can basically ignore muserfs, and ignore issues
like how to quit and so on.

A userfs filesystem process should only terminate under  one
condition:  it gets an EOF (a read of 0 bytes) from the ker-
nel on the file descriptor its  reading  operation  requests
from.   Muserfs  will  execute  it  so that most signals are
ignored, so it can handle them  itself.   When  the  muserfs
process is sent a SIGINT or SIGTERM it unmounts the filesys-
tem mount point with the uummoouunntt(8)  command  (used  so  that
/etc/mtab  is  updated properly).  This causes the kernel to
send the filesystem process  a  uupppp__uummoouunntt  operation.   The
kernel  will  close its end of the file descriptors, and the
process is expected to do the same, even if only by exiting.
Therefore,  when  trying  to unmount a userfs filesystem, do
not kill the filesystem process directly, and  do  not  kill
muserfs  with  SIGKILL.   Either  way  you should be able to
unmount with uummoouunntt as root.


_5_.  _U_s_i_n_g _l_i_b_u_s_e_r_f_s

_l_i_b_u_s_e_r_f_s is a C++ library designed to make writing filesys-
tem  clients  easier.  It is designed so all the work common
to almost all filesystems is encapsulated into a few generic
classes,  which  can  be  used  as base classes for specific
filesystem functions.










                           - 12 -



_5_._1  _B_a_s_i_c _C_l_a_s_s_e_s

The most basic classes, CCoommmm, FFiilleessyysstteemm and IInnooddee implement
the basic communication with the kernel and stub methods for
each operation.

The Comm class reads from the kernel and decodes the headers
of  the  operation  packets, and passes the remainder to the
Filesystem class.  The Filesystem performs the operation and
returns  an  unencoded return header and the encoded body of
the reply, if any.  All this is  not  exposed  to  the  code
using the library.

Filesystem  takes  each  operation  and dispatches it to the
appropriate place.  The Filesystem  class  directly  handles
the oprations which are global to the whole filesystem, such
as mounting or unmounting.  For operation which pertain to a
particular Inode (such as reading, or looking up a name in a
directory), Filesystem looks up the Inode in its  table  and
dispatches the operation to it.

The  Inode  class  has  all its methods implemented as stubs
which fail with the "not implemented" error code.   It  also
has members for the standard inode properties of mode, type,
size, ownership, links, timestamps and so on.

These classes are completely useless on their own,  so  they
must be used as base classes for other classes with actually
do something.  _l_i_b_u_s_e_r_f_s has more specific, but still gener-
ally useful classes.

SSiimmpplleeIInnooddee  implements  a  simple  inode with some normally
expected behaviour.  It has a constructor which  initializes
the  inode  properties to sensible values, and methods which
implement simple defaults for the open,  close  and  permis-
sions check operations.

DDiirrIInnooddee,, derived from SimpleInode, implements all the oper-
ations needed for a directory, including linking and unlink-
ing inodes to/from names, rename, and directory scanning and
lookup.  It takes very little extra code to implemement sim-
ple directory behaviour.

_5_._2  _W_r_i_t_i_n_g _y_o_u_r _o_w_n _f_i_l_e_s_y_s_t_e_m _c_l_a_s_s_e_s

A complete filesystem has two parts: a collection of inodes,
one for each file,  and  the  filesystem  structure  itself,
which  holds all the inodes together.  Each inode represents
a file in the filesystem, regardless of type.  There is only
one  inode  per  file  in  the  filesystem, even if the file
appears multiple times under different names.











                           - 13 -



_5_._2_._1  _A_r_g_u_m_e_n_t_s _a_n_d _r_e_t_u_r_n _v_a_l_u_e_s _o_f _o_p_e_r_a_t_i_o_n _m_e_t_h_o_d_s

Each method with the name ddoo__ssoommeetthhiinngg in the Filesystem and
Inode classes corresponds to an operation in the userfs pro-
tocol.  As a result, they all have similar  argument  struc-
tures.   All  such  methods  have ccoonnsstt uupp__pprreeaammbbllee &&pprree and
uupppp__rreeppll &&rreeppll which are references to the operation  reqest
and  reply  packet  headers.   Mostly there is no reason for
operation methods to use them, because  their  contents  are
dealt  with  in  lower  levels  of the library, but they are
there if you want them.

Each userfs protocol operation may  have  arguments,  return
values,  both  or neither, and the method for that operation
will have corresponding arguments.  For an operation named _x
the  method  argument with the operation arguments will have
the type ccoonnsstt uupppp___x__ss, and the return values argument  will
have  the  type  uupppp___x__rr, For example, the up_read operation
will correspond to the Inode method

     iinntt IInnooddee::::ddoo__rreeaadd((ccoonnsstt uupp__pprreeaammbbllee &&pprree,, uupppp__rreeppll &&rreeppll,,
                        ccoonnsstt uupppp__rreeaadd__ss &&aarrggss,, uupppp__rreeaadd__rr &&rreett));;

The contents of the  structures,  along  with  encoding  and
decoding  functions,  are  machine  generated, and therefore
have a consistent set of rules.  Mostly  its  quite  simple,
with  normal  base types directly corresponding to C and C++
types.  However, variable sized types need to  have  both  a
pointer  to  the  data and the size of the data encoded into
them.  Memory for the data is allocated with the C++ new and
delete  operators, with the aalllloocc method of a variable sized
object.  The memory is automatically freed by  the  method's
caller.  For example, if a return value of a method contains
an member called nnaammee representing a filename, it  would  be
set  with the following sequence (assuming oouurrnnaammee is a nor-
mal 0 terminated string):

     iinntt nnaammeelleenn == ssttrrlleenn((oouurrnnaammee));;
     rreett..nnaammee..aalllloocc((nnaammeelleenn));;                   //// AAllllooccaattee mmeemmoorryy
     rreett..nnaammee..nneelleemm == nnaammeelleenn;;                  //// SSeett nnaammee lleennggtthh
     mmeemmccppyy((&&rreett..nnaammee..eelleemmss,, oouurrnnaammee,, nnaammeelleenn));; //// SSeett nnaammee ccoonntteennttss
     //// ......

(alternatively, you could just point _r_e_t_._n_a_m_e_._e_l_e_m_s directly
at  _o_u_r_n_a_m_e,  because it won't try and free the string if it
was never allocated).

Note that strings are never zero terminated; the  length  of
the  returned  string is exactly the number of characters in
the string.

If the operation the method is performing fails,  it  should
return  the  appropriate  error  code,  or 0 if it succeeds.
Don't return -1 unless you mean to - it has special  meaning








                           - 14 -



(see below, in "Deferring Replies").

_5_._2_._2  _D_e_r_i_v_i_n_g _f_r_o_m _F_i_l_e_s_y_s_t_e_m

Filesystem  class must implement a number of methods to make
the filesystem viable:

_E_n_q_u_i_r_e is called when the kernel wants to find what  opera-
     tions your filesystem supports.  For all the operations
     that any inode will  implement,  return  0  and  return
     ENOSYS for the rest.

_d_o___m_o_u_n_t takes  no  arguments and returns the handle for the
     inode for the root directory (that is, the  top  direc-
     tory  of your filesystem).  The kernel immediately does
     a ddoo__iirreeaadd operation using this handle.
You can also implement _d_o___s_t_a_t_f_s which allows the kernel  to
get  space  and inode usage statistics, such as when "df" is
executed,  and  _d_o___u_m_o_u_n_t  so  the  filesystem  is  formally
informed  when it is unmounted (normally it just gets an EOF
from the kernel, and Comm::Run returns).

_5_._2_._3  _D_e_r_i_v_i_n_g _f_r_o_m _I_n_o_d_e

Most of the work of the filesystem is done  in  the  inodes.
All  inode classes must be derived from Inode, and generally
there will be a number of different Inode based classes.

It is probably better to use SimpleInode as  a  base  rather
than  plain Inode, because it implements simple defaults for
some methods,  which  would  otherwise  fail.   If  Filesys-
tem::Enquire  says that the filesystem supports a particular
operation, then any inode should be  prepared  to  get  that
operation from the kernel.

Similarly,  unless you are doing something special, deriving
directories from DirInode saves a lot of work.

Only  _d_o___i_r_e_a_d  need  be  implemented,  but  obviously   the
filesystem  will  do nothing interesting unless other opera-
tions are implemented.  do_iread returns the details of  the
inode.  Note that the Filesystem class calls the do_iread of
the Inode when the operation comes from the kernel,  so  the
inode  must  exist  by the time the kernel asks for it.  The
constructor for Inode automatically registers the  inode  in
the  Filesystem's  inode  table;  conversely, the destructor
removes it.

Here are  some  other  useful  methods  for  an  Inode;  the
descriptions  are  brief  and general, and don't necessarily
refer to all the arguments and return  values,  which  means
they can be ignored.










                           - 15 -



_d_o___i_w_r_i_t_e is,  obviously, the opposite of do_iread.  It sim-
     ply sets the various Inode values.

_d_o___i_p_u_t is called when the kernel is  no  longer  using  the
     inode.   That is, the inode is no longer open, the cur-
     rent or root directory of  a  process,  being  executed
     from or being mapped from.  If an inode is iput and has
     no names (has no name to inode mapping  in  any  direc-
     tory) it can be destroyed.

_d_o___r_e_a_d allows data to be read from the file.  The arguments
     are the offset in the file to start reading  from,  and
     the number of bytes desiried.  The method may return as
     many bytes up to that number as it likes, including  0,
     which means EOF.

_d_o___w_r_i_t_e does the converse; a block of data and an offset is
     passed in, and the method returns the number  of  bytes
     actually written.

_d_o___l_o_o_k_u_p translates  a  name into an inode reference.  This
     is typically implemented for directories; if  the  name
     exists  in  the  directory the method should return the
     handle of the inode, or fail with ENOENT.

_d_o___d_i_r_r_e_a_d returns the next directory entry  at  the  passed
     offset.  It returns the name and inode of the next file
     in the directory, and the size of the  entry  returned.
     This  is  added  by the kernel to the current offset in
     the directory to form the offset of the next  directory
     entry  for  the next call.  Since the directory entries
     don't correspond to real file storage as in other, more
     conventional  filesystems,  a  directory  entry  can be
     regarded as having an offset of 1.

     If the end of the directory has been reached, it should
     return a new offset of 0.

_d_o___m_u_l_t_i_r_e_a_d_d_i_r is similar to do_readdir, but can return any
     number of directory entries,  which  are  cached  in  a
     readahead  buffer in the kernel.  If a program asks for
     a directory entry for  an  inode  which  has  a  cached
     directory  entry  then  the entry will come from within
     the kernel rather than asking the  filesystem  process.
     This  operation  can  return  only one entry (and so is
     like do_readdir), or as many as will fit  in  a  return
     packet  (up  to  4k  or  so  of entries).  Returning no
     entries  means  the  end  of  the  directory  has  been
     reached.   Returning multiple entries improves the per-
     formance of directory scans, most  frequently  done  by
     ls, pwd and shell globbing.

     Look at the implementation of DirInode::do_multireaddir
     for details of how this should be dealt with.








                           - 16 -



_d_o___c_r_e_a_t_e does all file creation, whether  it  be  a  normal
     file,  a  directory, a fifo file or a device node.  The
     mode contains the type of the file in same way  as  the
     stat structure member sstt__mmooddee..

_d_o___u_n_l_i_n_k is the opposite, and is used for unlinking (remov-
     ing a name to inode mapping) files and directories.  If
     an  inode is not in use and has no links then it can be
     destroyed and its handle can be reused.

_d_o___s_y_m_l_i_n_k is used to create new symlink inodes.  It returns
     the handle of the new inode.

_d_o___r_e_a_d_l_i_n_k returns  the  pathname  which a symbolic link is
     pointing to.

_d_o___f_o_l_l_o_w_l_i_n_k returns the pathname of the  file  a  symbolic
     link  is  really  referring to.  If Filesystem::Enquire
     says the filesystem does not  support  this  operation,
     the readlink operation is used instead.

_d_o___o_p_e_n is  called  when  a  file is actually opened.  It is
     only necessary to implement this if it is important  to
     know whether a file is being opened as opposed to being
     used in any  other  way.   This  operation  passes  the
     filesystem  the  complete authentication credentials of
     the process doing the open, so that the filesystem  can
     do  extended  security checking or change the behaviour
     of the file depending on the user.

     This method can return a credential token, which  is  a
     magic number used by the filesystem process to refer to
     the set of credentials passed by the kernel.  The  ker-
     nel attaches this credentials token to each each opera-
     tion generated by system calls on the  file  descriptor
     generated  by  the open (read(), write(), readdir() and
     close()).  The credentials token is part  of  the  file
     descriptor,  so is inhereited unchanged if the descrip-
     tor is passed to another process, even if it  has  dif-
     ferent credentials.

     When  a  file is opened, a new file table entry for the
     inode is created.  That file table entry has  a  single
     file descriptor referring to it.  More file descriptors
     can be made to refer to the file table entry  with  the
     dduupp(2) system call, and can be removed with cclloossee(2).

_d_o___c_l_o_s_e is  called when the last file descriptor for a file
     table entry is closed.  The only argument for  this  is
     the  credentials  token  for  that file table entry, so
     that the filesystem can free all references to it.

_d_o___p_e_r_m_i_s_s_i_o_n is called when the filesystem says it wants to
     do permissions checking.  This is called a lot, and can








                           - 17 -



     cause many more operations to pass between  the  kernel
     and  filesystem  process.   If  the filesystem does not
     implement it the normal unix user/group/others checking
     is performed.

_d_o___r_e_n_a_m_e moves  a  file  from  one  directory  to a new one
     (though it may be the same).

_5_._2_._4  _D_e_r_i_v_i_n_g _f_r_o_m _D_i_r_I_n_o_d_e

DirInode implements a number of userfs operation methods for
directories,  such  as readdir, multireaddir and lookup.  It
also automatically constructs directories with "." and  ".."
entries pointing to the appropriate places.

DirInode deals with strings a lot, and rather than using the
normal cchhaarr ** it uses the libg++ SSttrriinngg class for all string
arguments  to  its  own methods (but not, of course, for the
userfs protocol operation methods).

DirInode expects a pointer to the parent directory, which is
also  a class derived from DirInode.  If the directory is at
the top of the  filesystem's  tree,  it  should  be  a  NULL
pointer.   The protected member ppaarreenntt points the the parent
inode, or tthhiiss for the top one.  It should never be NULL.

DirInode keeps a list of files in the  directory,  but  does
not allow that list to be directly visible.  The only opera-
tions for manipulating the directory contents for a  derived
class are:

iinntt lliinnkk((ccoonnsstt SSttrriinngg nnaammee,, IInnooddee **)) which  links a new name
     into the directory, updating all the reference and link
     counts;

iinntt uunnlliinnkk((ccoonnsstt SSttrriinngg nnaammee)) which does the opposite;

DDiirrEEnnttrryy **llooookkuupp((ccoonnsstt SSttrriinngg nnaammee)) which  returns  a direc-
     tory entry if it finds the file, or NULL otherwise; and

DDiirrEEnnttrryy **ssccaann((DDiirrEEnnttrryy ** &&ppooss)) which  returns the directory
     entry at _p_o_s_, updating it in the process,  or  NULL  if
     there are no more entries.

DDiirrEEnnttrryy **ssccaann((iinntt &&ppooss)) is  the  same,  except  it  uses an
     integer offset, which is less efficient.

_5_._3  _C_o_m_m_u_n_i_c_a_t_i_o_n_s _c_l_a_s_s_e_s

There are a number of communications classes in the library,
which provide different ways of multiplexing replies.

The  most  simple is the Comm class, which simply takes each
request, passes it to the  filesystem  and  sends  back  the








                           - 18 -



reply.  There are more complex comms classes though.

_5_._3_._1  _F_i_l_e _D_e_s_c_r_i_p_t_o_r _D_i_s_p_a_t_c_h_e_r

The  CCoommmmBBaassee  class  (base of all comms classes) provides a
dispatcher which allows  classes  to  register  interest  in
activity  on  file  descriptors.  This is used internally to
get input from the kernel, but can be used by  a  filesystem
to  monitor  any  file descriptor for any reason.  To do it,
simply derive a dispatcher class from  DDiissppaattcchhFFDD  and  call
ssttrruucctt  ddiisspp__ffdd  CCoommmmBBaassee::::aaddddDDiissppaattcchh((iinntt ffdd,, DDiissppaattcchhFFDD **,,
iinntt wwhhaatt)), where what can be one or more of  _D_I_S_P___R,  _D_I_S_P___W
or _D_I_S_P___E, for interest in read ready, write ready or excep-
tions.  When an event occurs,  the  DDiissppaattcchhFFDD::::ddiissppaattcchh((iinntt
ffdd,,  iinntt wwhhaatt)) method is called of the registered class.  If
it returns 0 then it is removed from the dispatch list.   If
it  returns  -1  it  indicates  an error; it is removed, and
CCoommmmBBaassee::::RRuunn(()) returns.  Returning 1 is a normal return.

CCoommmmBBaassee::::RRuunn(()) returns normally  when  there  are  no  more
entries on the dispatch list.

_5_._3_._2  _D_e_f_e_r_r_i_n_g _R_e_p_l_i_e_s

In normal operation, the filesystem processes one request at
a time, so each operation is replied to before the  next  is
looked  at.   This  is a convention of the way the user code
works, and not something the kernel enforces.  It just sends
requsts  as  processes  using  the filesystem need them, and
they block until the reply for their particular  request  is
replied  to.   Therefore,  it  is possible for multiple pro-
cesses to use the filesystem at once.

The DDeeffeerrCCoommmm and DDeeffeerrFFiilleessyyss classes have a method  called
DDeeffeerrRReeppllyy  (the  DeferFilesys once just calls the DeferComm
one to make it accessable to things within the  filesystem).
DeferReply  forks  the  filesystem;  on  the  child  side it
returns 0 and in the parent it returns the pid of the child.
If  the operation method returns -1 then the Filesystem just
goes on to processing the  next  request  from  the  kernel.
When  the child is ready to reply, it can just return in the
normal way.  The call to DeferReply sets  up  the  DeferComm
class in the child process to reply though the parent rather
than going straight to the kernel, in order to make sure the
replies  from multiple processes don't get jumbled up.  When
the reply has been sent back, the child process just  exits.

Because  the child is really a child process, you have to do
all the changes in filesystem state before calling  DeferRe-
ply,  or arrange for some other mechanism for the parent and
children to talk.











                           - 19 -



_5_._3_._3  _M_u_l_t_i_-_t_h_r_e_a_d_e_d _f_i_l_e_s_y_s_t_e_m_s

The TThhrreeaaddCCoommmm class creates a new  lightweight  thread  for
each  request,  using the Rex lwp library (in the lwp direc-
tory).  This allows multiple requests to be  handled  within
the  one  process,  so long as one thread does not block the
whole process in a system call.  The  file  descriptor  dis-
patcher in CommBase is useful for preventing this: see ftpfs
for a complete example of a multithreaded filesystem.


















































