





     UUsseerrffss -- FFiilleessyysstteemmss IImmpplleemmeenntteedd aass UUsseerr PPrroocceesssseess
           _J_e_r_e_m_y _F_i_t_z_h_a_r_d_i_n_g_e _<_j_e_r_e_m_y_@_s_w_._o_z_._a_u_>
                     Softway Pty. Ltd.



_1_.  _I_n_t_r_o_d_u_c_t_i_o_n

Userfs  is a mechanism by which normal user processes can be
a Linux filesystem.  There are many uses for  this,  includ-
ing:

Prototype filesystems

     Prototype  new  block  allocation  algorithms in a user
     process and debug with gdb before going into  the  com-
     pile-crash-reboot cycle of kernel development.

Infrequent use filesystems

     You  want to mount "FooBaz 0X" filesystems under Linux,
     but you don't want it that often, and you don't need it
     to  be  maximum  speed.   Rather than trying to get the
     kernel  itself  to  understand,  or  write  specialised
     tools, write a filesystem program.

Add capabilities to existing filesystems

     Want  compression, encryption, ACLs?  Have a process to
     mirror an existing file tree, but with your own  exten-
     tions and semantics.

Completely virtual filesystems and new interfaces

     Add  a  filesystem-type interface to an existing mecha-
     nism, or a filesystem interface as a new way of  repre-
     senting data.  Sick of FTP?  How about

          $$ ccdd //ffttpp//ttssxx--1111..mmiitt..eedduu//ppuubb//LLiinnuuxx
          $$ ccpp RREEAADDMMEE $$HHOOMMEE

     Or mail?

          $$ ccdd //mmaaiill
          $$ llss
          000011 000022 000033
          $$ ccaatt **//FFrroomm
          FFrroomm:: ssbbgg@@ssooccss..uuttss..eedduu..aauu
          FFrroomm:: lleerrooyy@@ssooccss..uuttss..eedduu..aauu ((LLeerrooyy))
          FFrroomm:: ttlluukkkkaa@@vviinnkkkkuu..hhuutt..ffii
          $$ ccaatt **//SSuubbjjeecctt
          MMoorree tthhiinnggss
          ((nnoonnee))
          TThhaatt uusseerrffss tthhiinngg
          $$








                            - 2 -



You get the idea.


_2_.  _I_n_s_t_a_l_l_a_t_i_o_n

_2_._1  _K_e_r_n_e_l

First  of  all, remove traces of previous verions of userfs:
make  sure   there   are   no   userfs   header   files   in
_l_i_n_u_x_/_i_n_c_l_u_d_e_/_l_i_n_u_x and no userfs patches to any of the ker-
nel source.

Apply the  patch  "userfs.diff"  in  the  normal  way  (it's
against  a  1.1.11 kernel).  Do a "make config; make depend;
make", and install.  It is not necessary to copy  any  files
into the linux source tree; only the patch is required.

Since  this is an _a_l_p_h_a release, you should know what you're
doing, and know how to fix up  simple  compilation  problems
with  the  kernel.  I'd be surprised if it just worked.  One
thing you may have problems with is config.in:  some  of  my
private changes have leaked in there, and you're almost cer-
tainly going to have local changes,  so  it's  unlikely  the
patch  will  be  clean.  The only important change is adding
the CONFIG_USERFS_FS line to the end of the filesystem  sec-
tion.

_2_._2  _N_o_n_-_k_e_r_n_e_l _C_o_d_e

Building  the  rest of the code should be a matter of typing
"make" at the  top  userfs  directory.  This  will  generate
dependencies  and  build  the utilities needed (genser), the
library, the clients using the library and the  kernel  mod-
ule.

This  version is a loadable kernel module.  When you specify
"yes" to the userfs question, it doesn't put the  filesystem
code  into  the  kernel;  it only puts some symbols into the
module symbol table needed at module load time.  I  hope  to
eliminate  the  need  for  any special changes to the kernel
soon.

To install the module you need the mmoodduuttiillss  package,  which
should  be  available from your local Linux ftp archive.  It
should be clear from it's documentation what you need to  do
with  _u_s_e_r_f_s_._o  to  get it into the kernel.  If you get some
warnings about multiply defined symbols, ignore them.   Only
undefined symbols are a problem.

_2_._3  _B_u_g_s_, _c_o_m_m_e_n_t_s_, _e_t_c

When  you  find  a  bug,  tell  me.  Please send me the code
you're using, the kernel version,  whatever  changes  you've
made  to userfs kernel code, and instructions or a script to








                            - 3 -



reproduce the bug.  Don't just tell me "it broke."

If you've made changes to the kernel code, please send it to
me  rather than sending it out to the world.  Please send me
comments, ideas for new kernel features, or things that  you
think  would  make  good  filesystems but you can't do right
now.  Also feel free  to  ask  questions  about  either  the
implementation  of  my  code or how to write your own userfs
clients.

Send all mail to Jeremy Fitzhardinge <jeremy@sw.oz.au>.


_3_.  _U_s_i_n_g _c_l_i_e_n_t_s

Clients are generally  mounted  with  the  mmuusseerrffss  command.
It's  quite  simple  -  it's  a program which makes sure the
mount point is legal for the user to mount  on,  and  mounts
the  given  process  with the user's permissions.  Note that
any user can mount a process, so more checking  is  done  on
the mount point than for a normal mount.  Unless the user is
root, the  mount  point  must  be  owned  by  the  user  and
writable.  mmuusseerrffss has a man page, which is even up to date.

There are only a couple of useful clients at present,  hhoommeerr
and  aarrccffss..   HHoommeerr  is  written  in  C++,  and uses the C++
library in the lib directory to do most of its work.  All it
does  is  set  up  a  single directory under its mount point
which contains symbolic links named after each user name  in
the  password  file,  which  points  to  the associated home
directory.  Mounted on /u it makes  a  passible  replacement
for  ~ expansion in a shell except it works for any program.

AArrccffss was written by David Gymer.  It allows you to mount  a
compressed  tar  file as a read-only filesystem, and inspect
it with normal tools.  It's pretty neat.


_4_.  _T_h_e_o_r_y _o_f _o_p_e_r_a_t_i_o_n

The kernel patch and module create  a  new  filesystem  type
into  the  kernel ("userfs").  The filesystem itself is very
simple; all it does it takes the  normal  kernel  filesystem
requests,  wraps  them  up  into  well-defined  packets  and
squirts them down a file descriptor  (presumeably  connected
to  a  process)  and  waits  for  the  reply on another file
descriptor.

If the filesystem process is on the same machine,  then  the
file  descriptors  are  probably  ordinary  pipes.  However,
userfs just reads and writes on  the  file  descriptors,  so
they  could be anything; files, sockets, devices and so on -
userfs doesn't care.









                            - 4 -



The following is not a  comprehensive  tutorial  on  writing
filesystems,  a  detailed "how it works" or specification of
the existing code.  It is intended to give some idea of what
I  was  thinking,  and  basic concepts to bear in mind while
poking about in my kernel or user code.

_4_._1  _P_r_i_o_r_i_t_i_e_s

I had a number of goals which I  wanted  satisfied  by  this
thing (from most to least important):

Flexibility

     I  want  the process to have as much power as a kernel-
     resident filesystem as possible.  I wanted to keep  the
     interfaces  as  generic  and  flexible.   This has been
     mostly achieved.

Robustness

     Since I see prototyping and development  an  major  use
     for  userfs,  it  seems important to make sure that the
     kernel code can't (at worst) crash or lock  up  if  the
     user code fails.  As it stands, it should be impossible
     for a user process to crash the kernel, but it is  pos-
     sible  for a bad user process to lock up processes try-
     ing to use the filesystem, and it is possible  for  the
     kernel  to  muck  up  reference  counts  and  make  the
     filesystem un-umountable.

     It is also possible for a process to go  strange  while
     it is being mounted, leaving a half-mounted filesystem.
     The mountpoint becomes a nulled out inode, but the ker-
     nel  refuses  to unmount it (because it isn't mounted),
     and refuses to mount on it (because it's busy).

Availability to users and Security

     I'd like any user to be able to write a filesystem pro-
     cess.    Traditionally,  filesystems  are  things  that
     embody the security of Unix,  and  are  therefore  very
     much  superuser-only things.  However, there are only a
     couple of really sensitive features that  shouldn't  be
     able to be controlled by any user: suid executables and
     device nodes.  Since a  trusted  superuser  process  is
     still  required to call the mount system call, and that
     process can set the no-suid and  no-device  flags,  the
     filesystem  code  can't use these as security holes.  I
     can't think of anything else that  needs  special  care
     from  a  security  point  of  view.  However, since the
     filesystem is completely under the control of the  pro-
     cess,  one  can make no assumptions about its contents.
     For example "." and ".." may not  do  expected  things,
     symlinks  may  point to places other than what readlink








                            - 5 -



     returns.  This makes navigating such filesystems a  new
     and interesting experience.

Efficiency

     Efficiency  is  my  lowest  priority,  but  it is still
     important.  Unfortunately the  other  requirements  (as
     usual)  make  things less efficient.  The most signifi-
     cant inefficiency is the context switches  between  the
     kernel  and the process.  I think the most benefits can
     be gained by reducing the number of these.   There  are
     several approaches to this:

        +o If  the process wants a well-defined behaviour for
          an operation, then it should be done in  the  ker-
          nel.   The  best  example  of  this  is permission
          checking - if the process wants normal  unix  per-
          mission  checking  then  it  doesn't need to do it
          itself.  Otherwise it can take all the  permission
          requests from the kernel, and implement other per-
          mission policies.  This is currently  implemented.
          When  the  filesystem is first mounted, the kernel
          asks the process what  requests  it  will  accept.
          From  that  point  the  kernel  will  do  sensible
          default actions  for  requests  that  the  process
          doesn't  want  to  handle rather than sending them
          down the connection.

        +o Group requests commonly issued together into  one.
          This  is  hard,  since  the  main kernel tells the
          filesystem code  very  little  about  what  it  is
          doing,  so  it  is  hard  to know what to do next.
          However, there  are  a  couple  of  single  kernel
          requests  that  are implemented in the protocol as
          two or more transactions.  This could be fixed  in
          future.

        +o Data  can  be  cached  in the kernel.  This is the
          most tricky, since kernel  caching  or  read-ahead
          limits  the amount of control the process can have
          over the data once read.  I think  this  could  be
          optionally  implemented,  depending on whether the
          process says it is OK to do  caching,  and  if  so
          what kinds.

          Currently  directory readahead is implemented with
          the uupppp__mmuullttiirreeaaddddiirr operation.  This  allows  the
          filesystem  process  to  return  as many directory
          entries as it  likes.   These  entries  are  saved
          attached  to  the  directory  inode in the kernel.
          Future readdir  requests  look  in  the  readahead
          buffer before sending an operation to the filesys-
          tem.  If it fails to find the required entry  then
          it   dumps  the  readahead  buffer  and  asks  the








                            - 6 -



          filesystem process again.  This is a win if  there
          are  lots  of  linear  directory searches (such as
          shell globbing, ls or pwd).

        +o A larger than 4k maximum packet size can be  used,
          now that the kernel memory allocator allows larger
          than 4k memory allocations.  However, since  pipes
          are  the most common connection beween filesystems
          and kernels, and pipes can  hold  at  most  4k  of
          data,  there  would  still  be  a  context  switch
          between filesystem code and kernel  every  4k,  so
          there wouldn't be much gain.
     A  number of people have suggested adding shared memory
     between the kernel and the  filesystem  process.   This
     would  be  quite  limiting  and  least likely option to
     improve things.  At the moment, the filesystem makes no
     assumptions  about  the  nature of the file descriptors
     for talking to the process.  To implement shared memory
     between  the  kernel and the process would require some
     way of finding the process on the other end of the file
     descriptors  (if  any),  and playing around with memory
     maps.  This still wouldn't cut down on  the  number  of
     context switches at all.

_4_._2  _P_r_o_t_o_c_o_l

The protocol used is machine independent, using network byte
order and defined type sizes.  The code to do the packetisa-
tion  and  depacketisation  is  generated automatically by a
program, given the description of each packet.  This is  not
fully  portable,  but  it  avoids  byte  order and structure
alignment problems.

A packet to or from the kernel has two parts.  The first  is
a header that contains a sequence number, an operation type,
a packet type, size of the following data,  and  a  protocol
version  number.  The packet type can either be a request, a
reply or an enquiry.  Requests and enquiries are always from
the  kernel  to the process, and the process only ever sends
replies to the kernel.  A reply's header has one extra field
-  an  error  field,  containing  an  error number.  Replies
always have the same sequence number as their  corresponding
request  or  enquiry.   If there was an error performing the
operation the error field is set to  the  error  number  and
there  is no additional data returned.  If there is no error
the error field is set to 0.

Following a request or reply packet is the  optional  opera-
tion-specific data.  This is passed through the protocol for
interpretation by the operation routines at each end.

The kernel may have multiple outstanding requests.  In other
words,  the kernel may send a new request before receiving a
reply to a previous one.   This  allows  the  filesystem  to








                            - 7 -



block one process for a slow operation while other processes
can  use  the  filesystem  for  shorter  operations.    This
improves  performance  on,  for  example, an ftp filesystem,
where one process may  be  using  a  fast  local  link,  and
another  may be using a slow international one, and each has
to wait for its own requests to  be  satisfied.   Of  course
this requires the filesystem process to be written with some
form  of  multi-threading.   If  the  process   just   reads
requests,  acts  on  them  and replies then it can do so and
ignore any kernel requests until it is ready  to  deal  with
them.

_4_._3  _H_a_n_d_l_e_s

The  base  element  of a filesystem is an _i_n_o_d_e.  The kernel
needs to be able to uniquely identify a inode.   Internally,
inodes  are  uniquely numbered within a filesystem, but each
mounted filesystem has  its  own  numbering.   Therefore  an
inode  is  completely  identified  by  an inode number and a
filesystem identifier (or _d_e_v_i_c_e,  even  though  it  doesn't
mean much for a filesystem which is not on a disk)..

A  device is what distinguished mounted filesystems from one
another, and an inode is what distinguishes files  within  a
filesystem  from each other.  Inode numbers are generated by
each filesystem, and are used by the kernel to refer to spe-
cific  files  to the filesystem specific code.  User process
filesystems are no exception.  Between the  kernel  and  the
filesystem  process,  files are refered to by using _h_a_n_d_l_e_s,
which are essentially 32 bit unsigned numbers.  When a  pro-
cess first mentions a file to the kernel, it gives it a han-
dle, which the kernel uses for all later operations  on  the
file.   It  the the handle which identifies the file, rather
than the name, so it is important to  use  distinct  handles
for  distinct  files,  and never change the handle of a file
once it has been given to the kernel.

_4_._4  _R_a_n_d_o_m _o_p_e_r_a_t_i_o_n _s_p_e_c_i_f_i_c _a_d_v_i_c_e _a_n_d _b_l_u_r_b

This may eventually accurately describe the whole  protocol,
but for now its a list of interesting points and things that
have bitten me.

Normally when  writing  a  filesystem  you  should  use  the
library  _l_i_b_u_s_e_r_f_s  (see  below), and use the advice in this
section as a guide on what kind of things should be  put  in
your userfs operation functions, or for idle curiosity.

_4_._4_._1  _M_o_u_n_t_i_n_g

The  mount  is initiated by a user process calling the mount
system call, with the  "userfs"  filesystem  type.   In  the
filesystem  specific  data,  the  process  passes  two  file
descriptor numbers for the kernel  to  read  and  write  to.








                            - 8 -



These  can by any kind of file descriptor at all.  Most com-
monly they would be  pipes  or  sockets,  but  there  is  no
restriction.   All the kernel requires that the one it talks
to the process with is writable, and the one it gets replies
from is readable.

The  most  important  request  is mounting.  Most important,
because it is the only  request  that  the  process  has  to
implement  (of  course, not implementing anything else would
be completely useless).  The request itself is not that com-
plex.   All  it  does is return a handle of the inode at the
root of the filesystem.   Most  commonly,  this  will  be  a
directory.   Userfs  does  not  enforce this, but the kernel
itself may.

After the process returns the root handle, the  kernel  will
probe  the  process  to see what operations it is willing to
support.  This is done by sending a series of enquire  pack-
ets  to  the  process.  The process should reply with normal
reply packets, with the errno field either set to 0 if it is
supported  or  ENOSYS if it isn't.  No real operation should
be done, and no additional information should be sent in the
reply.   If  the  process replies ENOSYS to an operation, it
will never receive it again, and the kernel will use a  sen-
sible  default  for it (typically what the kernel would nor-
mally do for an in-kernel filesystem if it  doesn't  support
the  operation).   The  filesystem process should send 0 for
the operations it explicitly supports, and ENOSYS for every-
thing  else,  so the protocol can be extended without having
to modify existing clients.


_4_._4_._2  _R_e_a_d_i_n_g _I_n_o_d_e_s

A pretty common (perhaps most common) thing for the filesys-
tem  to  be  doing is reading inodes.  For the process, this
involves filling out a  structure  much  like  the  kernel's
inode  structure  and the stat structure.  For this version,
it's important thing is to make sure  the  nlinks  field  is
non-zero.   If  it is 0 then the kernel will never "put" the
inode, and it will make the filesystem un-umountable.

When the kernel wants an inode from the filesystem, it  uses
the uupppp__iirreeaadd protocol request to fetch it.  This happens if
something in the kernel asks for the  inode,  but  it  isn't
already in the kernel inode table.  Therefore, once the ker-
nel has asked the filesystem for an inode, it will  not  ask
for it again while anything in the kernel is using it.

Once  nothing  in  the kernel is using the inode, the kernel
will issue an uupppp__iippuutt operation, which may be  preceded  by
an  uupppp__iiwwrriittee if the inode was modified in use.  A filesys-
tem need not implement these operations if there is no  need
to do so.








                            - 9 -



_4_._4_._3  _O_p_e_n _a_n_d _C_l_o_s_e

Reading and putting inodes are the basic operations: regard-
less of what an inode is being used for it will be read  and
put.   The  uupppp__ooppeenn  and  uupppp__cclloossee operations specifically
correspond to the ooppeenn(2) and cclloossee(2) system  calls.   Nor-
mally  a filesystem doesn't need to perform any special han-
dling for these operations, and would not normally implement
them, except if it wants to know the identity of the process
doing the operations.  When a program issues an open  system
call for a file on the user filesystem, the kernel will send
a _u_p_p___o_p_e_n operation for the file, which  includes  complete
identifcation  for  the process which issued the open.  When
the filesystem replies it returns a _c_r_e_d_e_n_t_i_a_l_s _t_o_k_e_n_.  From
then on, that credentials token is sent to the filesystem in
all operations which correspond to a system call which takes
a    file    descriptor    as    an    argument,   such   as
rreeaadd,wwrriittee,rreeaaddddiirr,llsseeeekk and so on.

This may seem a bit complex: why not just send  the  uid  of
the process with the operations?  Well, the credentials of a
process are quite complex,  since  they  include  the  real,
saved  and  effective  uids and gids of the process, and all
the auxillary groups.  Sending this with each request  would
be quite an overhead.  The idea is that all the info is sent
on a open, and the filesystem process can associate it  with
a token internally, and only use the token in correspondance
with the kernel.

Also note that the credentials are associated with  an  open
file  descriptor,  not the process performing the operation.
Mostly a process will deal with file descriptors it has cre-
ated itself, but its quite possible that it can inherit file
descriptors from another process with  a  different  set  of
credentials.  In this case the filesystem knows the original
process's credentials, but not for the process which is per-
forming the operaion.

_4_._4_._4  _H_a_n_d_l_e _M_a_n_a_g_e_m_e_n_t

The  handle  of  an  inode  is  only  way the kernel and the
filesystem can talk about a file.  An inode  may  have  more
than  one  name, or no names at all, so file names are not a
good way of keeping track of a file.   Use  inodes  in  your
filesystem  code  to keep track of files, even if you have a
simple 1:1 name to file mapping.

Handles must also be consistent.  Of course you must  always
keep  the  handles of files currently in use consistent, but
you must also keep them consistent between uses.  If a  pro-
cess  opens a file once, closes it and then reopens it, then
it will expect it to have the same inode  number  if  it  is
supposed to be the same file (which is how processes using a
user filesystem will see the file handles).








                           - 10 -



Also, if you ever refer to a handle  in  communication  with
the kernel, you must be prepared for the kernel to ask about
it.  For example, if the kernel reads a directory  with  the
uupppp__rreeaaddddiirr  or  uupppp__mmuullttiirreeaaddddiirr  operations, each entry in
the reply will have a name and a handle.  Each of those han-
dles  must  be the handle of the file if the kernel looks at
the file more closely.  If you make them all the  same,  for
example,  then  a  program would be entitled to believe that
all the names in the directory refer to one actual file.

_4_._4_._5  _D_e_a_l_i_n_g _w_i_t_h _m_u_s_e_r_f_s

Writing a client which can be handled  by  muserfs  is  very
easy.   The important thing to remember is that the filesys-
tem process can basically ignore muserfs, and ignore  issues
like how to quit and so on.

A  userfs filesystem process should only terminate under one
condition: it gets an EOF (a read of 0 bytes) from the  ker-
nel  on  the  file descriptor its reading operation requests
from.  Muserfs will execute it  so  that  most  signals  are
ignored,  so  it  can  handle them itself.  When the muserfs
process is sent a SIGINT or SIGTERM it unmounts the filesys-
tem  mount  point  with  the uummoouunntt(8) command (used so that
/etc/mtab is updated properly).  This causes the  kernel  to
send  the  filesystem  process  a uupppp__uummoouunntt operation.  The
kernel will close its end of the file descriptors,  and  the
process  is  expected  to  do  the same, if only by exiting.
Therefore, when trying to unmount a  userfs  filesystem,  do
not  kill  the  filesystem process directly, and do not kill
muserfs with SIGKILL.  Either way  you  should  be  able  to
unmount with uummoouunntt as root.


_5_.  _U_s_i_n_g _l_i_b_u_s_e_r_f_s

_l_i_b_u_s_e_r_f_s is a C++ library designed to make writing filesys-
tem clients easier.  It is designed so all the  work  common
to almost all filesystems is encapsulated into a few generic
classes, which can be used  as  base  classes  for  specific
filesystem functions.

_5_._1  _B_a_s_i_c _C_l_a_s_s_e_s

The most basic classes, CCoommmm, FFiilleessyysstteemm and IInnooddee implement
the basic communication with the kernel and stub methods for
each operation.

The Comm class reads from the kernel and decodes the headers
of the operation packets, and passes the  remainder  to  the
Filesystem class.  The Filesystem performs the operation and
returns an unencoded return header and the encoded  body  of
the  reply,  if  any.   All  this is not exposed to the code
above the library.








                           - 11 -



Filesystem takes each operation and  dispatches  it  to  the
appropriate  place.   The  Filesystem class directly handles
the oprations which are global to the whole filesystem, such
as mounting or unmounting.  For operation which pertain to a
particular Inode (such as reading, or looking up a name in a
directory),  Filesystem  looks up the Inode in its table and
dispatches the operation to it.

The Inode class has all its  methods  implemented  as  stubs
with  fail  with  the "not implemented" error code.  It also
has members for the standard inode properties of mode, type,
size, ownership, links, timestamps and so on.

These  classes  are completely useless on their own, so they
must be used as base classes for other classes with actually
do something.  _l_i_b_u_s_e_r_f_s has more specific, but still gener-
ally useful classes.

SSiimmpplleeIInnooddee implements a simple  inode  with  some  normally
expected  behaviour.  It has a constructor which initializes
the inode properties to sensible values, and  methods  which
implement  simple  defaults  for the open, close and permis-
sions check operations.

DDiirrIInnooddee,, derived from SimpleInode, implements all the oper-
ations needed for a directory, including linking and unlink-
ing inodes to/from names, rename, and directory scanning and
lookup.  It takes very little extra code to implemement sim-
ple directory behaviour.

_5_._2  _W_r_i_t_i_n_g _y_o_u_r _o_w_n _f_i_l_e_s_y_s_t_e_m _c_l_a_s_s_e_s

A complete filesystem has two parts: a collection of inodes,
one  for  each  file,  and  the filesystem structure itself,
which holds all the inodes together.  Each inode  represents
a file in the filesystem, regardless of type.  There is only
one inode in the filesystem, even if the file appears multi-
ple times under different names.

_5_._2_._1  _A_r_g_u_m_e_n_t_s _a_n_d _r_e_t_u_r_n _v_a_l_u_e_s _o_f _o_p_e_r_a_t_i_o_n _m_e_t_h_o_d_s

Each method with the name ddoo__ssoommeetthhiinngg in the Filesystem and
Inode classes corresponds to an operation in the userfs pro-
tocol.   As  a result, they all have similar argument struc-
tures.  All such methods have  ccoonnsstt  uupp__pprreeaammbbllee  &&pprree  and
uupppp__rreeppll  &&rreeppll which are references to the operation reqest
and reply packet headers.  Mostly there  is  no  reason  for
operation  methods  to  use them, because their contents are
dealt with in lower levels of  the  library,  but  they  are
there if you want them.

Each  userfs  protocol  operation may have arguments, return
values, both or neither, and the method for  that  operation
will have corresponding arguments.  For an operation named _x








                           - 12 -



the method argument with the operation arguments  will  have
the  type ccoonnsstt uupppp___x__ss, and the return values argument will
have the type uupppp___x__rr, For example,  the  up_read  operation
will correspond to the Inode method

     iinntt IInnooddee::::ddoo__rreeaadd((ccoonnsstt uupp__pprreeaammbbllee &&pprree,, uupppp__rreeppll &&rreeppll,,
                        ccoonnsstt uupppp__rreeaadd__ss &&aarrggss,, uupppp__rreeaadd__rr &&rreett));;

The  contents  of  the  structures,  along with encoding and
decoding functions, are  machine  generated,  and  therefore
have  a  consistent  set of rules.  Mostly its quite simple,
with normal base types directly corresponding to C  and  C++
types.   However,  variable  sized types need to have both a
pointer to the data and the size of the  data  encoded  into
them.  Memory for the data is allocated with the C++ new and
delete operators, with the aalllloocc method of a variable  sized
object.   The  memory is automatically freed by the method's
caller.  For example, if a return value of a method contains
an  member  called nnaammee representing a filename, it would be
set with the following sequence (assuming oouurrnnaammee is a  nor-
mal 0 terminated string):

     iinntt nnaammeelleenn == ssttrrlleenn((oouurrnnaammee));;
     rreett..nnaammee..aalllloocc((nnaammeelleenn));;                   //// AAllllooccaattee mmeemmoorryy
     rreett..nnaammee..nneelleemm == nnaammeelleenn;;                  //// SSeett nnaammee lleennggtthh
     mmeemmccppyy((&&rreett..nnaammee..eelleemmss,, oouurrnnaammee,, nnaammeelleenn));; //// SSeett nnaammee ccoonntteennttss
     //// ......

Note  that strings are never 0 terminated; the length of the
returned string is exactly the number of characters  in  the
string.

If  the  operation the method is performing fails, it should
return the appropriate error code,  or  0  if  it  succeeds.
Don't  return -1 unless you mean to - it has special meaning
(see below, in "Deferring Replies").

_5_._2_._2  _D_e_r_i_v_i_n_g _f_r_o_m _F_i_l_e_s_y_s_t_e_m

Filesystem class must implement a number of methods to  make
the filesystem viable:

_E_n_q_u_i_r_e is  called when the kernel wants to find what opera-
     tions your filesystem supports.  For all the operations
     that  any  inode  will  implement,  return 0 and return
     ENOSYS for the rest.

_d_o___m_o_u_n_t takes no arguments and returns the handle  for  the
     inode  for  the root directory (that is, the top direc-
     tory of your filesystem).  The kernel immediately  does
     a ddoo__iirreeaadd operation using this handle.
You  can also implement _d_o___s_t_a_t_f_s which allows the kernel to
get space and inode usage statistics, such as when  "df"  is
executed,  and  _d_o___u_m_o_u_n_t  so  the  filesystem  is  formally








                           - 13 -



informed when it is unmounted (normally it just gets an  EOF
from the kernel, and Comm::Run returns).

_5_._2_._3  _D_e_r_i_v_i_n_g _f_r_o_m _I_n_o_d_e

Most  of  the  work of the filesystem is done in the inodes.
All inode classes must be derived from Inode, and  generally
there will be a number of different Inode based classes.

It  is  probably  better to use SimpleInode as a base rather
than plain Inode, because it implements simple defaults  for
some  methods,  which  would  otherwise  fail.   If Filesys-
tem::Enquire says that the filesystem supports a  particular
operation,  then  any  inode  should be prepared to get that
operation from the kernel.

Similarly, unless you are doing something special,  deriving
directories from DirInode saves a lot of work.

Only   _d_o___i_r_e_a_d  need  be  implemented,  but  obviously  the
filesystem will do nothing interesting unless  other  opera-
tions  are implemented.  do_iread returns the details of the
inode.  Note that the Filesystem class calls the do_iread of
the  Inode  when the operation comes from the kernel, so the
inode must exist by the time the kernel asks  for  it.   The
constructor  for  Inode automatically registers the inode in
the Filesystem's inode  table;  conversely,  the  destructor
removes it.

Here  are  some  other  useful  methods  for  an  Inode; the
descriptions are brief and general,  and  don't  necessarily
refer  to  all  the arguments and return values, which means
they can be ignored.

_d_o___i_w_r_i_t_e is, obviously, the opposite of do_iread.  It  sim-
     ply sets the various Inode values.

_d_o___i_p_u_t is  called  when  the  kernel is no longer using the
     inode.  That is, the inode is no longer open, the  cur-
     rent  or  root  directory  of a process, being executed
     from or being mapped from.  If an inode is iput and has
     no  names  (has  no name to inode mapping in any direc-
     tory) it can be destroyed.

_d_o___r_e_a_d allows data to be read from the file.  The arguments
     are  the  offset in the file to start reading from, and
     the number of bytes desiried.  The method may return as
     many  bytes up to that number as it likes, including 0,
     which means EOF.

_d_o___w_r_i_t_e does the converse; a block of data and an offset is
     passed  in,  and the method returns the number of bytes
     actually written.









                           - 14 -



_d_o___l_o_o_k_u_p translates a name into an inode  reference.   This
     is  typically  implemented for directories; if the name
     exists in the directory the method  should  return  the
     handle of the inode, or fail with ENOENT.

_d_o___d_i_r_r_e_a_d returns  the  next  directory entry at the passed
     offset.  It returns the name and inode of the next file
     in  the  directory, and the size of the entry returned.
     This is added by the kernel to the  current  offset  in
     the  directory to form the offset of the next directory
     entry for the next call.  Since the  directory  entries
     don't correspond to real file storage as in other, more
     conventional filesystems,  a  directory  entry  can  be
     regarded as having an offset of 1.

     If the end of the directory has been reached, it should
     return a new offset of 0.

_d_o___m_u_l_t_i_r_e_a_d_d_i_r is similar to do_readdir, but can return any
     number  of  directory  entries,  which  are cached in a
     readahead buffer in the kernel.  If a program asks  for
     a  directory  entry  for  an  inode  which has a cached
     directory entry then the entry will  come  from  within
     the  kernel  rather than asking the filesystem process.
     This operation can return as few as 1 entry (and so  is
     like  do_readdir),  or  as many as will fit in a return
     packet (up to 4k  or  so  of  entries).   Returning  no
     entries  means  the  end  of  the  directory  has  been
     reached.  Returning multiple entries improves the  per-
     formance of directory scans, most frequently done by ls
     and pwd.

     Look at the implementation of DirInode::do_multireaddir
     for details of how this should be dealt with.

_d_o___c_r_e_a_t_e does  all  file  creation,  whether it be a normal
     file, a directory, a fifo file or a device  node.   The
     mode  contains  the type of the file in same way as the
     stat structure member sstt__mmooddee..

_d_o___u_n_l_i_n_k is the opposite, and is used for unlinking (remov-
     ing a name to inode mapping) files and directories.  If
     an inode is not in use and has no links then it can  be
     destroyed and its handle can be reused.

_d_o___s_y_m_l_i_n_k is used to create new symlink inodes.  It returns
     the handle of the new inode.

_d_o___r_e_a_d_l_i_n_k returns the pathname which a  symbolic  link  is
     pointing to.

_d_o___f_o_l_l_o_w_l_i_n_k returns  the  pathname  of the file a symbolic
     link is really referring  to.   If  Filesystem::Enquire
     says  the  filesystem  does not support this operation,








                           - 15 -



     the readlink operation is used instead.

_d_o___o_p_e_n is called when a file is  actually  opened.   It  is
     only  necessary to implement this if it is important to
     know whether a file is being opened as opposed to being
     used  in  any  other  way.   This  operation passes the
     filesystem the complete authentication  credentials  of
     the  process doing the open, so that the filesystem can
     do extended security checking or change  the  behaviour
     of the file depending on the user.

     This  method  can return a credential token, which is a
     magic number used by the filesystem process to refer to
     the  set of credentials passed by the kernel.  The ker-
     nel attaches this credentials token to each each opera-
     tion  generated  by system calls on the file descriptor
     generated by the open (read(), write(),  readdir()  and
     close()).   The  credentials  token is part of the file
     descriptor, so is inhereited unchanged if the  descrip-
     tor  is  passed to another process, even if it has dif-
     ferent credentials.

     When a file is opened, a new file table entry  for  the
     inode  is  created.  That file table entry has a single
     file descriptor referring to it.  More file descriptors
     can  be  made to refer to the file table entry with the
     dduupp(2) system call, and can be removed with cclloossee(2).

_d_o___c_l_o_s_e is called when the last file descriptor for a  file
     table  entry  is closed.  The only argument for this is
     the credentials token for that  file  table  entry,  so
     that the filesystem can free all references to it.

_d_o___p_e_r_m_i_s_s_i_o_n is called when the filesystem says it wants to
     do permissions checking.  This is called a lot, and can
     cause  many  more operations to pass between the kernel
     and filesystem process.  If  the  filesystem  does  not
     implement it the normal unix user/group/others checking
     is performed.

_d_o___r_e_n_a_m_e moves a file from  one  directory  to  a  new  one
     (though it may be the same).

_5_._2_._4  _D_e_r_i_v_i_n_g _f_r_o_m _D_i_r_I_n_o_d_e

DirInode implements a number of userfs operation methods for
directories, such as readdir, multireaddir and  lookup.   It
also  automatically constructs directories with "." and ".."
entries pointing to the appropriate places.

DirInode deals with strings a lot, and rather than using the
normal cchhaarr ** it uses the libg++ SSttrriinngg class for all string
arguments to its own methods (but not, of  course,  for  the
userfs protocol operation methods).








                           - 16 -



DirInode expects a pointer to the parent directory, which is
also a class derived from DirInode.  If the directory is  at
the  top  of  the  filesystem's  tree,  it  should be a NULL
pointer.  The protected member ppaarreenntt points the the  parent
inode, or tthhiiss for the top one.  It should never be NULL.

DirInode  keeps  a  list of files in the directory, but does
not allow that list to be directly visible.  The only opera-
tions  for manipulating the directory contents for a derived
class are:

iinntt lliinnkk((ccoonnsstt SSttrriinngg nnaammee,, IInnooddee **)) which links a new  name
     into the directory, updating all the reference and link
     counts;

iinntt uunnlliinnkk((ccoonnsstt SSttrriinngg nnaammee)) which does the opposite;

DDiirrEEnnttrryy **llooookkuupp((ccoonnsstt SSttrriinngg nnaammee)) which returns  a  direc-
     tory entry if it finds the file, or NULL otherwise; and

DDiirrEEnnttrryy **ssccaann((PPiixx &&ppooss)) which returns the  directory  entry
     at  _p_o_s_,  updating  it in the process, or NULL if there
     are no more entries.

DDiirrEEnnttrryy **ssccaann((iinntt &&ppooss)) is the  same,  except  it  uses  an
     integer offset, which is less efficient.

_5_._2_._5  _D_e_f_e_r_r_i_n_g _R_e_p_l_i_e_s

In normal operation, the filesystem processes one request at
a time, so each operation is replied to before the  next  is
looked  at.   This  is a convention of the way the user code
works, and not something the kernel enforces.  It just sends
requsts  as  processes  using  the filesystem need them, and
they block until the reply for their particular  request  is
replied  to.   Therefore,  it  is possible for multiple pro-
cesses to use the filesystem at once.

The Comm and Filesystem classes have a method called  DDeeffeerr--
RReeppllyy  (the  Filesystem once just calls the Comm one to make
it accessable to things within the filesystem).   DeferReply
forks  the filesystem; on the child side it returns 0 and in
the parent it returns the pid of the child.  If  the  opera-
tion  method  returns -1 then the Filesystem just goes on to
processing the next request from the kernel.  When the child
is  ready  to  reply,  it can just return in the normal way.
The call to DeferReply sets up the Comm class in  the  child
process  to  reply  though  the  parent  rather  than  going
straight to the kernel, in order to make  sure  the  replies
from  multiple  processes  don't  get  jumbled up.  When the
reply has been sent back, the child process just exits.

Because the child is really a child process, you have to  do
all   the   changes   in  filesystem  state  before  calling








                           - 17 -



DeferReply, or arrange for some other mechanism for the par-
ent and children to talk.

























































