Subject: Novell E-Net Shell Loading Problem (OS Problem) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Technical Description and Resolution !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!! NOTE: This problem has been fixed in later releases !!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! NOTE: This problem is actually an operating system bug. The probability that this problem will ever be encountered in the field is extremely low. We encountered the problem only because we were extensively testing every possible configuration. It is somewhat unlikely that the configuration on which we found the problem would actually be used by NetWare users. PROBLEM: Under specific circumstances, the Novell E-Net shell does not properly load about 15% of the time. This occurs only under the following conditions. The file server is an IBM PC XT running NetWare 86 v2.0a with a Novell E-Net interface card. The network consists of at least 3 workstations, all with Novell E-Net interface cards. All but one of the workstations are logged in and are running disk intensive programs. The last workstation then attempts to load the shell, but about 15% of the time it gets the following error message before getting fully connected to the file server: Network Error on Server SERVERNAME:Error reading from network. Abort or Retry? Attempting to retry only causes the error to be redisplayed. The workstation must then be rebooted before attempting to reload the shell. Once the shell is successfully loaded, the system will run indefinitely without any errors. CAUSE: This error is caused by the way that NetWare buffers incoming packets, coupled with the very high network speed of E-Net and the extreme slowness of the IBM PC XT hard disk. The problem occurs when the workstation that is attempting to load the shell gets out of synchronization with the file server. This occurs just as the shell is attempting to request initial service from the server. This is the most critical point of communication between the workstation and the file server. From then on until the workstation is rebooted, communication between the workstation and the file server is essentially deadlocked. The workstation keeps sending requests to the server, but the server keeps ignoring them because they are in the wrong sequence from that which it is expecting. This error condition has nothing to do with the Novell E-Net boards except that the condition was aggravated by the network's high speed. Potentially, this error could occur with any very high speed network running on a file server that has an extremely slow hard disk. TECHNICAL EXPLANATION: When the shell is first loaded on a workstation, it sends an "Allocate Slot Request Packet" to the file server requesting it to open a slot for the workstation. The handling of this packet is critical because several parameters are initialized at this time. If the file server is doing considerable processing when receive packets arrive, they are placed in a LIFO buffer known as the Turbo Receive Buffer. The "Allocate Slot Request Packet" is placed in the buffer along with other packets until the operating system gets around to processing them. If the file server is extremely busy, the workstation shell times-out. Thinking that the request packet may have gotten lost, it sends a retry "Allocate Slot Request Packet". This retry packet is also received and stored in the Turbo Receive Buffers. Now the operating system finally completes its other tasks and starts to process the incoming packets. Because the buffer is a LIFO, the retry "Allocate Slot Request Packet" is processed first. The slot parameters are initialized including the packet sequence number. The packet sequence number is the number of the next packet in the sequence of communications between the file server and that specific workstation. A reply is generated and sent back to the workstation incrementing the packet sequence number stored in the file server. The workstation then sends its next request packet again incrementing the packet sequence number. The file server buffers the incoming packet and eventually processes it and sends back a reply packet. Again the packet sequence number is incremented. Finally, the file server gets around to processing the original "Allocate Slot Request Packet" that had been buried in the bottom of the stack. This packet causes the file server to reinitialize the slot parameters for that particular workstation including the packet sequence number. The file server then sends a reply for this packet out which is ignored by the workstation because it is the wrong sequence number. Now the file server will no longer accept and process packets from the workstation because the sequence numbers are out of synchronization. The workstation is attempting to send valid packets with valid sequence numbers to the file server but since the file server's sequence number counter has been reinitialized, none of these packets are recognized and they are discarded. For example, the file server is expecting packet number two (since the sequence number was reinitialized) but the workstation is attempting to send packet number four or higher. The workstation can retry sending the new packets to the file server forever and the server will never process them. Thus the communication between the file server and the workstation is deadlocked until the workstation is rebooted and the shell sends a new "Allocate Slot Request Packet." If the shell successfully loads and connects to the file server it means that the "Allocate Slot Request Packet" has been serviced properly without duplication. The system then will continue to correctly operate indefinitely. The packets may get processed out of order because of the LIFO reordering. However, this has no effect on the processing because the sequence numbers are still synchronized between the file server and the workstation. SOLUTION: Since the problem described above is directly linked to the operating system, the best way to eliminate the problem would be to modify the operating system code. This could be an update consideration in future releases of NetWare. Since the error is noncritical and recoverable, no immediate solution is being sought. This decision is based upon the following facts. The error occurs only under a very specific set of rare circumstances. The error only occurs about 15% of the time under these circumstances and it is easily recovered from by rebooting and reloading the shell. Once the shell is successfully loaded, no further problems are experienced. TIC: date=3-30-87, ref#=031887.008, status=RESOLVED