Bugs found while compiling and running XINU.

MASTER copy of bugs, dated 29. May 2001

Document written by:
   Tom Evans

Contact info:
   InitialSurnameAt tennyson.com.au
   (to prevent spam robots from reading my email address)

NOTE Version E3 has not been checked at this stage (apart from bugs
T.18-25).

BOOKS
-----

The XINU TCP code is documented in "Internetworking with TCP/IP Volume II.

The sources corresponding to the first edition are available as

    ftp://ftp.cs.purdue.edu/pub/dls/xinu7.9.tar.Z

Code approximating the second edition is available as

    ftp://ftp.cs.purdue.edu/pub/Xinu/XINU-PENTIUM.TAR.Z

New sources corresponding to the next edition of the book are available as:

    ftp://ftp.cs.purdue.edu/pub/dls/v2e3ALPHA_tar.gz

I will refer to these as the "E1" (Edition 1), "E2" and "E3" versions.

The second version is the one to use for anyone working with this
code. If porting to a MC68xxx, start with the E2 code and replace the
assembly and driver code with that from the E1 version. Then replace
as much as possible from the E3 version.

VERSION DIFFERENCES
-------------------

There are two minor versions of the E1 code. The "xinu7.9" sources
correspond to the first edition with modifications documented in the
"v2.errata" file. The original files are available in the "BOOK.OLD"
directory.

Datestamps on the source files indicate that some of the errata items
were added between late 1991 and April 1993.

There are major differences between E1 and E2. Specifically the Urgent Data
handling is different (simpler and better) in E2, and OSPF and Multicasting
have been added. There are probably other changes.

E1 is a Sun-3 port with drivers for Sun3 console, LANCE ethernet and
contains 68000 assembly code.

E2 is a 80x86 port with drivers for PC hardware supporting Intel
EtherExpress and SMC Ultra ethernet cards, and contains 80386 assembly.

There is no release of code exactly matching the second edition of the
book. The "XINU-PENTIUM" code is close, but has 14 TCP files with newer
datestamps than the publishing date of the book. The newer files are
tcpsync.c, tcpxmit.c, tcp_in.c, tcprmss.c, tcpackit.c, tcpdemux.c,
tcplisten.c, tcpreset.c, tcpsend.c, tcpcksum.c, tqdump.c, tcpinp.c,
tcpstat.c and tcpbind.c. Differences are usually minor. For instance,
tcpcksum() is passed an extra parameter in the newer code.

E3 is an ANSI version of the E2 code with bugfixes.

OTHER VERSIONS
--------------
There are also SPARC, M68K, PDP11, VAX and MAC code sources in
"ftp.cs.purdue.edu/pub/Xinu".  The SPARC version might contain virtual
memory extensions.  The M68K file is huge, but only because it includes
all object and binary files!  The sources are far older than the 7.9
release, and don't even contain TCP/IP. Stay away from them.

BUGS IN TCP/IP
--------------
Bugs are listed with the file they appear in, the version of code they
apply to and an indication of the severity.

----------
T.1.
File:     tcpd/tcpread.c
Version:  E1 and E2
Symptom:  tcpread() calls complete when they shouldn't
Severity: Minor

tcpread.c contains:

    } else if (len > ptcb->tcb_rbcount &&
               ptcb->tcb_flags & TCBF_BUFFER &&
>>>>           (ptcb->tcb_flags & TCBF_PUSH|TCBF_RDONE) == 0) {
        signal(ptcb->tcb_mutex);
        goto retry;

Operator precedence puts & ahead of |, so the above will NEVER evaluate
to TRUE and tcpread will never block until a read buffer is filled.

----------
T.2.
File:     ip/ipfhcopy.c
Version:  E1 and E2
Symptom:  May crash fragmenting IP datagrams. May generate bad
          IP fragments.
Severity: Major

ipfhcopy.c contains the code:

    if (otype & IPO_COPY) {
        blkcopy(&pepto->ep_data[hlen],
        pepfrom->ep_data[i-1], olen);
        hlen += olen;

That doesn't copy the DATA, but copies from where the data is pointing
TO (memory locations -128 to +127 or 255), it should be "&pepfrom...".

----------
T.3.
File:     tcp/tfcoalesce.c
Version:  E1 and E2
Symptom:  May crash or corrupt memory
Severity: Major

tfcoalesce contains the code:

    struct tcpfrag *tf;
    ...
    if (ptcb->tcb_rnext == ptcb->tcb_finseq)
        goto alldone;
    ...
alldone:
    do
        freemem(tf, sizeof(struct tcpfrag));
    while ((tf = (struct tcpfrag *) tcpdeq(ptcb->tcb_rsegq)) != NULL);

If the first test passes (the most common case) it then goes and calls
freemem() on an uninitialised pointer. It should initialise tf to NULL
and only free it if it isn't NULL.

----------
T.4.
File:     tcp/tfinsert.c
Version:  E1 and E2
Symptom:  TCP Sequence number sign problem, may drop session
Severity: Major

tfinsert contains the code:

    if (enq(ptcb->tcb_rsegq, tf, -tf->tf_seq) < 0)

This inserts the item on the queue using the following in enq():

    /* start at tail and move towards head, as long as key is greater */
    while(i >= 0 && key > qp->q_key[i])

This can fail when the sequence number changes sign and there
is already an item on the queue. The new item is likely to be put in
the wrong order in the queue, leading to failure to reassemble or
worse.

This bug could be fixed by changing the way enq handles keys, but this
might break other code that makes assumptions about enq. Otherwise the
segmentation code could be rewritten to do its own queuing or it could
put items on the queue using offsets from a particular sequence number the
way the E1 Urgent code does.

Probably the best way to fix this is to change enq() to use the same
sequence-number comparison that the TCP code uses. Thus change:

   while(i >= 0 && key > qp->q_key[i]) { --i; }
to
   while(i >= 0 && (key - qp->q_key[i]) > 0) { --i; }

----------
T.5.
File:     tcpd/tcpread.c
Version:  E1 and E2
Symptom:  None, unless major changes made during porting
Severity: None

tcpread contains the code:

        if (state != TCPS_ESTABLISHED && state != TCPS_CLOSEWAIT)
                return SYSERR;

tcpwrite contains the code:

        if (state != TCPS_ESTABLISHED && state != TCPS_CLOSEWAIT)
                return SYSERR;

At first glance these appear to be wrong, as it is usually legal in TCP
to write but not to read in CLOSEWAIT. The code is correct as the read is
required to return the status (that the state machine is in CLOSEWAIT). It
is not permitted to read in FINWAIT as there is no Xinu equivalent to
the Unix shutdown(), and so it isn't possible to legally be in these
states.

Anybody adding a shutdown() call to Xinu will have to change these tests.

----------
T.6.
File:     tcp/tcpsynrcvd.c
Version:  E1 and E2
Symptom:  Could cause problems when multiple requests made on listener
Severity: Minor

tcpsynrcvd contains the code:

    if (pcount(pptcb->tcb_listenq) >= pptcb->tcb_lqsize) {
        TcpAttemptFails++;
        signal(pptcb->tcb_mutex);
        return tcbdealloc(ptcb);
    }

When the listen queue is full, the received TCP is deallocated, instead
of the session being CLOSED. The Close is never signaled through to
the other end. Thus it requires the other end to send data in order
to get RST segments. If the client only listens for responses (for
instance calling a daytimed) it will wait "for ever".

Changing the code to the following fixes the problem.

 if (tcplenq(pptcb->tcb_listenq) >= pptcb->tcb_lqsize) {
         TcpAttemptFails++;
         tcpreset(pep);
         signal(pptcb->tcb_mutex);
         return tcptcbdealloc(ptcb);
      }

Except it causes another bug. See T.24.

----------
T.7.
File:     tcp/tcpabort
Version:  E1 and E2
Symptom:  FINWAIT1 and CLOSING locking up.
Severity: Major

If there's no response to the FIN sent in FINWAIT1 or CLOSING,
tcpabort is called and the socket remains in these states for ever.
The same happens if RST or SYN are received in these states. This
case (not getting ACK to final FIN) is only handled properly in
the LASTACK state.  This can be fixed in tcpabort(). In the case
of the above two states the socket is forced to TCPS_TIMEWAIT and
tcpwait() is called. The following code fixes the bug:

Original code:

   tcpkilltimers(ptcb);
   ptcb->tcb_flags |= TCBF_RDONE|TCBF_SDONE;

Changed to:

   ptcb->tcb_flags |= TCBF_RDONE|TCBF_SDONE;
   tcpkilltimers(ptcb);
   if ((ptcb->tcb_state == TCPS_FINWAIT1) ||
       (ptcb->tcb_state == TCPS_CLOSING)) {
      ptcb->tcb_state = TCPS_TIMEWAIT;
      tcptcbfreebufs(ptcb);
      tcpwaittime(ptcb);
   }
   ptcb->tcb_flags |= TCBF_RDONE|TCBF_SDONE;

----------
T.8.
This bug seems to have gone missing. I don't know how this happened.

----------
T.9.
File:     tcp/tcpacked.c
Version:  E2 only
Symptom:  TCP can't send data at all
Severity: Major

Discovered by fwmiller@cs.umd.edu (Frank W. Miller) and sent to
comp.os.xinu on 22nd August, 1997:

tcpacked contains the following code:

    if ((ptcb->tcb_flags & TCBF_SUPOK) &&
        SEQCMP(ptcp->tcp_ack, ptcb->tcb_supseq) >= 0)
    ptcb->tcb_sbstart = (ptcb->tcb_sbstart+acked) % ptcb->tcb_sbsize;

As written, it only advances the tcb_sbstart pointer (current starting
point in the send buffer) when the send urgent pointer is valid.
This looks like an error caused by a misapplied patch. The lack of
indentation on the third line indicates that the previous line has
been deleted.

The missing line (insert after line starting "SEQCMP" above) is:

         ptcb->tcb_flags &= ~TCBF_SUPOK;

Without this line, if the SUPOK flags gets set it never gets cleared.

----------
T.10.
File:     tcp/tcpsndlen.c
Version:  E1 and E2. E1 is far worse.
Symptom:  Low throughput (10-30kbytes/sec) when far-end flow-controlled
Severity: Major. Disobeys RFC-1122.

In the E2 code, the calculation for the length of data to send is as
follows:

    if (rexmit || (ptcb->tcb_code & TCPF_SYN)
        *poff = 0;
    else
        *poff = ptcb->tcb_snext - ptcb->tcp_suna;
    datalen = ptcb->tcb_scount - *poff;
    datalen = min(datalen, ptcb->tcb_swindow);
    datalen = min(datalen, ptcb->tcb_smss;

The amount of data that a TCP is allowed to send is best stated
in RFC1122, section "4.2.3.4  When to Send Data" as:

    The "useable window" [TCP:5] is:
        U = SND.UNA + SND.WND - SND.NXT
    i.e., the offered window less the amount of data sent
    but not acknowledged.

The above equation is easier to understand if written as:
    U = SND.WND - (SND.NXT - SND.UNA)

Xinu sends the minimum of the data available to send and the CURRENT
send window. It should calculate the OFFERED send window, which is
ptcb->tcb_swindow - (ptcb->tcb_snext - ptcb->tcp_suna). Under some
circumstances it can overrun the receiver and send beyond the offered
window.

Changing the second last line to the following improves this:

       datalen = min(datalen, ptcb->tcb_swindow - *poff);

The above code will send zero bytes of data into a closed window when in
PERSIST state.

In the E1 code the same code is as follows:

    if (rexmt || ptcb->tcb_sudq == EMPTY) {
        if (rexmt || (ptcb->tcb_code & TCPF_SYN))
            *poff = 0;
        else
            *poff = ptcb->tcb_snext - ptcb->tcb_suna;
        datalen = ptcb->tcb_sbcount - *poff;
        /* remove urgent data holes */
        if (!rexmt) {
                datalen = tcpshskip(ptcb, datalen, poff);
                datalen = min(datalen, ptcb->tcb_swindow);
        }
        return min(datalen, ptcb->tcb_smss);
    }

In the case where the remote window is full, and TCP is in TCBF_PERSIST
state, it is calling tcpsend which calls tcpsndlen with rexmit TRUE.
The E2 code returns a datalen of zero and poff of zero. The E1 code
returns tcp_smss which is either 1440 or 576.

It thus sends 1440 bytes (or however much is pending if less) into a closed
window which is against RFC-1122's recommendations. The remote drops the
data and transmission is stalled until the sender's retransmit timer goes
off. This drops throughput by a factor of 40.

The fix for this is to move the line setting datalen against tcp_swindow
down one line so it is outside the if statement, and to change it as shown
for the E2 code case to add the "offered window" check.

----------
T.11.
File:     tcp/tcppersist.c
Version:  E1 and E2
Symptom:  Stunningly low throughput (3 bytes/sec) when near-end
          flow-controlled
Severity: Major

tcppersist contains the code:

   if (event != PERSIST)
      return OK;  /* ignore everything else */

The above means that when in PERSIST state, sending one packet every
minute, it is also limiting itself to sending only one ACK per minute.

When a segment arrives that require an ACK to be sent, tcpdodat() sets
TCBF_NEEDOUT, and this causes tcpdata() to call tcpkick(), which puts
a SEND event on the TCP output port.
This is fine unless the transmitter is in PERSIST mode, in which case
the above code ignores the SEND. If the PERSIST timer is wound up to
its maximum value, this means that only one ACK gets sent per minute,
and only one segment is able to be received per minute. This also does
wonders to the sender's RTT estimation (it is now 1 minute).

Instead of a megabyte/second I've measured 200 bytes/MINUTE. Allowing SEND
in tcppersist() fixes this, as long as the persist timer is only increased
on PERSIST events. This is not a complete fix as it doesn't handle
RETRANSMIT timeouts properly.

----------
T.12.
File:     tcpd/tcpwr.c
Version:  E1 and E2
Symptom:  Throughput lower than necessary. May give "silly windows
          syndrome" with other TCP's that don't have receiver-side
          avoidance code.
Severity: Minor

tcpwr is documented in the E3 book at 17.7.3 as performing "silly
windows avoidance". The code performs:

    if (isurg || ptcb->snext == ptcb->tcb_suna)
        tcpkick(ptcb);

This is "Congestion Control" as documented by Nagle in RFC-896 but
doesn't include SWS as documented in RFC-1122 (see next bug). The code
will run faster if it calls tcpkick when it can send a full-sized
segment as recommended by RFC-1122.

----------
T.13.
File:     tcp/tcpsend.c
Version:  E1 and E2
Symptom:  May give "silly windows syndrome" with other TCP's that don't
          have receiver-side avoidance code.
Severity: Minor (disobeys "MUST" item in RFC-1122)

The transmitting code doesn't perform Sender SWS-Avoidance, as
is required by RFC-1122. It will send tinygrams whenever the receiver
window opens. It thus depends on the receiver performing receiver
SWW avoidance.

----------
T.14.
File:     tcp/tcpgetdata.c
Version:  E1 and E2
Symptom:  Serious drop in throughput, down to 2 kbytes/sec when
          receiver is flow-controlling sender.
Severity: Minor (disobeys "MUST" item in RFC-1122)

The following code in tcpgetdata sends an ACK when the window opens:

    /*
     * open the receive window, if it's closed and we've made
     * enough space to fit a segment.
     */
    if (SEQCMP(ptcb->tcb_cwin, ptcb->tcb_rnext) <= 0 &&
        tcprwindow(ptcb)) {
        ptcb->tcb_flags |= TCBF_NEEDOUT;
       tcpkick(ptcb);
    }

It only sends one ACK when the window opens FROM ZERO. When
interoperating with BSD-derived code (including Win-NT), BSD's
SWS-avoidance code stops it from sending small segments into a closing
window until 5 SECONDS have elapsed. Since Xinu doesn't send any ACKs
in this case throughput drops to 2k bytes/second.

The fix for this involves changing tcprwindow() so that it can be told
not to update ptcb->tcp_cwin and to have the above code changed to
send ACKs whenever the advertised (in the previous ACK) window increases
by 2*MSS or BUFSIZE/2.

----------
T.15.
File:     tcp/tcprwindow.c
Version:  E1 and E2
Symptom:  Minor decrease in performance.
Severity: Minor. Change makes code match RFC-1122.

tcprwindow contains the code:

        /*
         *      Receiver-Side Silly Window Syndrome Avoidance:
         *  Never shrink an already-advertised window, but wait for at
         *  least 1/4 receiver buffer and 1 max-sized segment before
         *  opening a zero window.
         */
        if (window*4 < ptcb->tcb_rbsize || window < ptcb->tcb_rmss)
                window = 0;

If the above is changed from "||" to "&&" (and the comment changed from
"buffer and" to "buffer or") it matches RFC1122 and sends window updates
earlier.

----------
T.16.
File:     tcp/tcpsend.c
Version:  E1 and E2
Symptom:  Code can't send sensible Urgent Data.
Severity: Minor.

tcpsend contains the code:

        if (ptcb->tcb_flags & TCBF_SUPOK) {
                unsigned short up = ptcb->tcb_supseq - ptcp->tcp_seq;

                if (up >= 0) {
                        ptcp->tcp_urgptr = up + 1;      /* 1 past end */
                        ptcp->tcp_code |= TCPF_URG;
                } else
                        ptcp->tcp_urgptr = 0;
        } else
                ptcp->tcp_urgptr = 0;

"unsigned short up" can never be less than zero, so if urgent data is
ever sent the pointer will always be set to something, even when not
legal. Together with a previous bug which never cleared the TCBFSUPOK flag,
once set it always reports urgent data, as far as 65k ahead of the current
sequence number. The variable must be long and must be compared against
the correct limits for a 16-bit pointer.

----------
T.17.
File:     tcp/tfinsert.c
Version:  E1 and E2
Symptom:  Compiler warnings
Severity: Minor

tfinsert is defined as:

    int  tfinsert(ptcb, seq, datalen, gotfin)

tfinsert is defined with four parameters, 4th not used, only called with
three.

----------
T.18.
File:     tcp/tcpopts.c
Version:  E1 and E2, fixed in E3
Symptom:  Lockup with Windows 98
Severity: Major

Tcpopts will go into an infinite loop if any TCP options other than
the ones defined in RFC793 are received (options 0, 1 and 2). Options
documented in RFC1700 now go up to 18, and Windows 98 seems to use
Option 4 (SACK Permitted). This is fixed in E3 by the addition of the
code between "default" and "break" below:

    default:
        popt++; /* skip option code */
        if (*popt > 0 && *popt <= popend - popt - 1)
                popt += *popt - 1;
        else
                popt = popend;  /* bogus option length */
        break;

----------
T.19.
File:     tcp/tcpfin1.c and tcpclosing.c
Version:  E1, E2 and E3
Symptom:  TCP locks up in FINWAIT2/PERSIST, can corrupt files
Severity: Major

The symptom is that TCP sockets sometimes lock up in FINWAIT2 and PERSIST
states. This is "impossible" as PERSIST means there is more data to
transmit and FINWAIT2 means that all data including the FIN has been
acknowledged. I have had FTP sessions doing this, and also transferring
corrupted (short) files while not locking up.

Summarising the "expected sequence of events":

   tcpclose() sets TCBF_SNDFIN and transitions to FINWAIT1.
   tcpsend() sees TCBF_SNDFIN and sets TCPF_FIN.
   tcpfin1() calls tcpacked() which clears both TCBF_SNDFIN and TCPF_FIN.
   tcpfin1() waits until TCPF_FIN is cleared and transitions to FINWAIT2.
   tcpclosing() also waits on TCPF_FIN and transitions to TIMEWAIT.

The above sequence assumes that tcpclose() always calls tcpsend() (via
tcpkick(), psend(), signal(), tcpout() and tcposwitch()). It also
assumes that tcpsend() will immediately set TCPF_FIN.

The first assumption can be broken, but only by changing the process
priorities or by changing the implementation (running the TCP code
on a different operating system). It may also fail if the stack is
using delayed acks.

The second assumption fails when TCP is sending into a closing (or closed)
window and doesn't send the FIN or set TCPF_FIN. This is how it gets
into FINWAIT2 *and* PERSIST.

tcpsend() only sets TCPF_FIN if the following test passes:

   if ((ptcb->tcb_flags & TCBF_SNDFIN) &&
       SEQCMP(ptcp->tcp_seq+datalen, ptcb->tcb_slast) == 0)
      ptcb->tcb_code |= TCPF_FIN;

The fix is to change the test in tcpfin1() and tcpclosing() to test
for both flags. There are three places:

tcpclosing():

   if (((ptcb->tcb_code & TCPF_FIN) == 0) &&
>>>    ((ptcb->tcb_flags & TCBF_SNDFIN) == 0)) {
      ptcb->tcb_state = TCPS_TIMEWAIT;
      tcptcbfreebufs(ptcb);
      tcpsignal(ptcb->tcb_ocsem);      /* wake closer */
      tcpwaittime(ptcb);
   }

tcpfin1():

   if (ptcb->tcb_flags & TCBF_RDONE) {     /* FIN received from remote */
      if ((ptcb->tcb_code & TCPF_FIN) ||   /* Our FIN not ACKed*/
>>>       (ptcb->tcb_flags & TCBF_SNDFIN)) {
         ptcb->tcb_state = TCPS_CLOSING;
      }
      else {
         ptcb->tcb_state = TCPS_TIMEWAIT;
         tcptcbfreebufs(ptcb);
         tcpsignal(ptcb->tcb_ocsem);   /* wake closer */
         tcpwaittime(ptcb);
      }
   } else if (((ptcb->tcb_code & TCPF_FIN) == 0) &&
>>>           ((ptcb->tcb_flags & TCBF_SNDFIN) == 0)) {
      tcpsignal(ptcb->tcb_ocsem);      /* wake closer */
      ptcb->tcb_state = TCPS_FINWAIT2;
   }

----------
T.20.
File:     netwrite.c
Version:  E1, E2 and E3
Symptom:  Buffer leak in IP
Severity: Minor (unless CLASSD addresses are being generated)

The netwrite() function is meant to dispose of all buffers sent
to it. If it (or any of the routines it calls) fails to deliver a
buffer it returns SYSERR. In the following case it seems to get
this wrong:

        if (IP_CLASSD(pep->ep_nexthop)) {
                restore(ps);
                return SYSERR;
        }

The above code should be replaced by:

        if (IP_CLASSD(pep->ep_nexthop)) {
                freebuf(pep);
                restore(ps);
                return SYSERR;
        }

Note that "Class D" addresses are used by IGMP. It looks from the
above code that IGMP may be "present but unsupported", as it looks
as if all Multicast packets sent through netwrite() will be dropped.

----------
T.21.
File:     tcp/tcpdemux.c
Version:  E1 and E2
Symptom:  TCP won't work with source TCP ports over 32767
Severity: Major

The port in a TCP packet is declared to be unsigned, but the
stored port number in the TCP TCB is signed, as follows:

   tcp.h:  unsigned short  tcp_sport; /* source port       */
   tcp.h:  unsigned short  tcp_dport; /* destination port  */
   tcb.h:  short   tcb_rport;         /* remote TCP port   */
   tcb.h:  short   tcb_lport;         /* local TCP port    */

The code in tcpdemux.c that demultiplexes incoming TCP packets
to the correct socket is:

   if (ptcp->tcp_dport == tcbtab[tcbn].tcb_lport &&
       ptcp->tcp_sport == tcbtab[tcbn].tcb_rport &&
       blkequ(pip->ip_src, tcbtab[tcbn].tcb_rip, IP_ALEN) &&
       blkequ(pip->ip_dst, tcbtab[tcbn].tcb_lip, IP_ALEN)) {
       break;
   }

If the source port is negative the above comparison fails, and TCP
responds with a RESET segment for each packet after the SYN packet.
Ciscos tend to use the high source port numbers.

Changing the definitions for tcb_lport and tcb_rport in tcb.h
fixes the bug. This has already been done in the E3 code.

----------
T.22.
File:     ip/ipreass.c
Version:  All
Symptom:  IP Reassembly doesn't follow RFC791
Severity: Minor

Spotted by Charles Esson <charlese@cvs.com.au>.

RFC791 calls for the Datagram ID used during reassembly to be the
Source and Destination IP Addresses, the Protocol field and the ID.
Xinu ignores the Protocol and the Destination IP address. This is
unlikely to cause problems in practice.

----------
T.23.
File:     ip/ipfhcopy.c
Version:  All
Symptom:  IP Fragmentation can corrupt header length
Severity: Major

Spotted by Charles Esson <charlese@cvs.com.au>.

The ipfhcopy() code conditionally copies options from one IP packet
to another (pepfrom to pepto). It never updates the IP Header Length
field in the packet, and neither does any other code in Xinu. Thus if
an option isn't copied the packet will be corrupted as the data in the
packet starts where the options finished, but the IP header still has
the old (longer) length in it.

----------
T.24.
File:     tcp/tcpsynrcvd.c
Version:  All
Symptom:  TCP rejects connections to LISTEN socket under load.
Severity: Major

Sockets opened as "listeners" have a connection queue, with a default
depth of 5. Other TCPs (Linux, BSD) simply ignore incoming SYN requests
when the queue is full. Xinu will accept an "unlimited" number of SYNs,
returning SYN-ACKs and transitioning the sockets to the SYNRECVD state.
Sockets are not put on the queue at this stage. It is only when the
next ACK arrives that tcpsynrcvd() attempts the enqueue the socket,
transition it to ESTABLISHED and signal the user to ACCEPT the
connection.

If the queue is full, the socket is closed and a RST segment is sent.
This closes sockets that the remote end considers to be "fully open".

This can best be fixed by moving this check from tcpsynrcvd to
tcplisten, which on receipt of a SYN has to manually count all
of its "child" sockets in TCPSYNRCVD state and add the queue depth.
If the queue size is exceeded (for all the currently queued sockets
plus the ones in TCPSYNRVCD state which will be queued later), the
packet is ignored so the client will retry.

This bug was caused by the "fix" in T.6, but prior to T.6 it did
something worse.

----------
T.25.
File:     tcp/tcpkick.c
Version:  All
Symptom:  TCP Delayed ACK code broken
Severity: Major

Delayed ACKs are off be default in the XINU code. This disobeys RFC
1122. If enabled, TCBF_DELACK delays all ACKS by 1/5 second, and
prevents ACKS from being sent for every second segment. This would
limit throughput to one window-full for every 1/5 second (or 128
kbytes/second). TCBF_DELACK also delays all outbound DATA packets
for 1/5 second as well.

----------
T.26.
File:     tcp/tcpostate.c
Version:  All
Symptom:  Sockets lock up in CLOSING state
Severity: Major

The tcpostate() function contains the following code:

        if (ptcb->tcb_sbcount == 0) {
                ptcb->tcb_ostate = TCPO_IDLE;
                return OK;
        }

This sends the transmitter IDLE if all the data has been acked.
It does not check to see if the FIN is waiting to be acked. Thus
a socket that sends data, closes, sends a FIN, receives a FIN and
then had the data (but not the FIN) acked will end up in
CLOSING state but with its transmitter IDLE. It remains there
forever as it never retransmits the FIN and the TIMEWAIT timer
isn't running either.

The test should be:

        if ((ptcb->tcb_sbcount == 0)  &&
            ((ptcb->tcb_code & TCPF_FIN) == 0)) {

----------
T.27.
File:     tcp/tcpiss.c
Version:  All
Symptom:  Initial sequence number selection bad.
Severity: Minor

RFC 793 recommends that the initial sequence number be generated
from a 32-bit clock incrementing every 4 microseconds.

RFC 1948 recommends cryptographic generateion of these
numbers to defend against sequence number attacks.

Xinu starts with "clktime" and increments by 904 for each
connection. It should use the clock for every connection and
should also add some randomness.

----------

BUGS IN PARTS OTHER THAN TCP/IP
-------------------------------
X.1.
File:     shell/x_snmp.c
Version:  E1 and E2
Symptom:  SNMP code may fail to detect EOF or may fail on DELETE
Severity: Minor

x_snmp.c contains the code:

    char ch;
    if ((ch = getc(stdin)) == EOF)

Variable "ch" should be int. It may not match EOF or may match on DELETE.

----------
X.2.
File:     rfs/mount.c
Version:  E1 and E2
Symptom:  May crash on some input
Severity: Minor

mount.c contains the code:

    if (prefix == NULL)
        prefix == NULLSTR;
    if (replace == NULL)
        replace == NULLSTR;

The second sets of "==" should be "=".

----------
X.3.
File:     shell/x_ifstat.c
Version:  E1
Symptom:  Compiler warning
Severity: Minor

x_ifstat.c contains the variable:

        char    *str[80];

That's meant to be (and is used as) "char str[80]".

XINU TASK SWITCHING AND INTERRUPTS PROBLEMS
-------------------------------------------

This documents what the E1 (Sun3) code does. I don't know if the
PENTIUM (E2 dated) code is the same, but it looks like it is (only perhaps
worse).

It looks like the Xinu scheduling system only allows 10 "events" to
run per second. This may explain why the some of the TCP throughput
bugs weren't found as they were concealed by scheduling problems.

## Note I'm not 100% sure about the way the code works. I've already
## corrected these notes twice after people have pointed out
## misunderstandings. if I've got something wrong PLEASE tell me.

Interrupt service routines usually operate as follows:

    1. A hardware interrupt occurs,
    2. The Interrupt Service Routine runs, and then
    3. Reads data from hardware and stores in a structure shared with
       the rest of the driver, and then
    4. Signals the operating system to schedule the driver code, which
    5. Runs at non-interrupt time, handles the data and signals the
       operating system to schedule the user code that wants the data.

Xinu supports the above scheme by using signals to schedule the other
parts of the code. This will work as expected AS LONG AS the interrupt
service routine process priority is HIGHER than the driver priority. If it
is the same or lower, the signal will call-through (will immediately run)
the mainline driver code, resulting in that code running DURING the
interrupt-service routine. Worse, when the context switches, the saved
status register is restored and this ENABLES HARDWARE INTERRUPTS again.
This is from a call in the middle of an interrupt service routine. This
can't possibly work properly.

Xinu calls ready() with RESCHYES from signal(). Thus signal() will
switch to the signaled process if it is the same or higher priority.
The Ethernet interrupt service routines set themselves to NETIPRI which
matches the network priority NETPRI. Thus ethinter() calls ethdemux()
which calls ni_in() calls ip_in() which calls send() which calls
ready(RESCHYES) which runs (say) TCP or IP.

Both TCP and IP are created with IPPRI and TCPPRI, which are also 100,
so the IP and TCP code will ALSO run in the interrupt context. When
does this stop? The user code runs under the shell, which is run at
SHCMDPRI (20).

Thus the Ethernet interrupt service routine has run code all the way
to TCP, which may then transmit a frame back through TCP, IP and the
Ethernet code, all under the Ethernet receive interrupt context.

When does the user code run? It can't get scheduled from the idle
task as idle never gives up the CPU.

User code has to wait until the next non-clock interrupt runs, or
until the clock interrupt preemption code notices nothing has happened
for a whole second (1/10 second on PENTIUM code) and calls resched().
If a process is woken up by the timer (through wakeup()) then resched()
is run by this and things will fly again.

Any other interrupt (serial, ethernet, keyboard) will "accidentally"
cause user code to run as part of their signaling to the other half
of the drivers.

It is likely that user code, TCP, IP and Ethernet driver code will run
under a serial port context, blocking all following serial port
interrupts until control returns to the interrupt routine (it of all
interrupt routines doesn't reenable hardware interrupts until it exits).

No instructions can ever run that aren't run under an interrupt context
because of the above priorities and the following code in nulluser:

   while (TRUE) {          /* run forever without actually */
       pause();            /* executing instructions */
   }

The above behavior could be fixed if all interrupt service routines
changed their runtime priorities to something very high (say 1000)
and the idle loop (nulluser) was changed to:

   while (TRUE) {          /* run forever without actually */
      pause();             /* executing instructions */
      disable(ps);
      resched();
      restore(ps);
   }

With this change the interrupt code would run until the first send(),
signal() or resched() and then return to nulluser() where the above
resched() would immediately run all the appropriate processes.

This would minimise the time spent in interrupts, and spent with
interrupts being locked out by other interrupts, and would result in
most of the code being switched from the idle context.

NOTE I used to suggest calling chprio() to change the priority of
the interrupt service routine, but chprio() calls resched() and that
has unwanted side effects. Directly modifying "proctab[currpid].pprio"
in the interrupt routine seems to work:

> Subject:      Re: resched() analysis...
> Date:         Fri, 04 Sep 1998 07:55:49 -0500
> From:         Tom McDermott <mcdermot@NOSPAM.aud.alcatel.com>
> Organization: Alcatel Network Systems
> Newsgroups:   comp.os.xinu
>
> Have implemented the code:
>
> while(TRUE) {
>   disable(ps);
>   resched();
>   restore(ps);
> }
>
>     and that does the trick!  That, along with the changing of
> priorities, using direct manipulation of the proctab structure has the
> PING performance where it should be -- about 2-3 milliseconds.  Seems
> stable, no lock-ups or crashes after one-two hours of testing.

It looks like the Pentium Ethernet code fails to change the priority
at all. This gives the same effective result as in the SUN3 case.

There's no code in the Pentium source that changes the priority.

The Pentium idle process is simpler than the SUN one. It is:

    while (TRUE)
        /* empty */;

It executes the above loop all the time it isn't servicing interrupts.
As with the SUN3 one it should be calling resched(), preferably after
calling pause().

The following excerpts illustrate the problems with the scheduling system
that led to the above change recommendations. Both were caused by the
interrupt service routines running at a low (zero) priority and then
calling through to user code that blocked, leaving the ready queue empty.

Tom McDermott [mcdermot@aud.alcatel.com] (Mon 24/08/98 10:45 PM) stated
(concerning adding the resched() in the idle task):
> I tried this, but it introduces a problem in that after some thousands
> of passes through the resched loop, resched encounters a pass when the
> readylist is empty - this of course instantly crashes XINU.
> My guess is that there is a bug somewhere else in XINU that allows
> the condition of an empty readlist, and the pause(); kludge seems to
> solve it.  Any experience with this?

This seems to be the same as...

Frank Miller (fwmiller@cs.umd.edu, http://www.cs.umd.edu/~fwmiller) has
spotted/suffered a problem where an interrupt service routine interrupts
the IDLE task and calls through to code that BLOCKS. There is now no
routines on the READY list and the code crashes unless special code is
added to cope with this case.

----------

CODING PRACTICES
----------------
These are practices that GCC complains about. These aren't bugs, but
this is how I found most of the real ones. All of these are in the E1
sources.

tcpd/tcpcon.c:42:
    if (error = ptcb->tcb_error)

tcpd/tcpmopen.c:34:
    if (error = tcpcon(ptcb))

tcp/tcprmss.c:29:
    ptcp->tcp_offset = ((hlen<<2) & 0xf0) | ptcp->tcp_offset & 0xf;

The code is OK, but it is necessary to add brackets to shut gcc up...

tcp/tcprtt.c
    rrt = tmclear(tcps_oport, MKEVENT(RETRANSMIT, ptcb-&tcbtab[0]));

The definition of MKEVENT in h/tcpfsm.c:
    #define MKEVENT(timer, tcb)  ((tcb<<3) | (timer & TMASK))

Gcc complains about the resulting "ptcb - &tcbtab[0] << 3"

The definition of MKEVENT needs ()s around parameters.

tcpok.c

   rv = (ptcp->tcp_seq - ptcb->tcb_rnext) >= 0 &&
        (ptcp->tcp_seq - wlast) <= 0;
   rv |= (slast - ptcb->tcb_rnext) >= 0 && (slast - wlast) <= 0;

The above is correct, but opaque. Note the use of "|=" to combine two
boolean conditions.