Bugs found while compiling and running XINU. MASTER copy of bugs, dated 29. May 2001 Document written by: Tom Evans Contact info: InitialSurnameAt tennyson.com.au (to prevent spam robots from reading my email address) NOTE Version E3 has not been checked at this stage (apart from bugs T.18-25). BOOKS ----- The XINU TCP code is documented in "Internetworking with TCP/IP Volume II. The sources corresponding to the first edition are available as ftp://ftp.cs.purdue.edu/pub/dls/xinu7.9.tar.Z Code approximating the second edition is available as ftp://ftp.cs.purdue.edu/pub/Xinu/XINU-PENTIUM.TAR.Z New sources corresponding to the next edition of the book are available as: ftp://ftp.cs.purdue.edu/pub/dls/v2e3ALPHA_tar.gz I will refer to these as the "E1" (Edition 1), "E2" and "E3" versions. The second version is the one to use for anyone working with this code. If porting to a MC68xxx, start with the E2 code and replace the assembly and driver code with that from the E1 version. Then replace as much as possible from the E3 version. VERSION DIFFERENCES ------------------- There are two minor versions of the E1 code. The "xinu7.9" sources correspond to the first edition with modifications documented in the "v2.errata" file. The original files are available in the "BOOK.OLD" directory. Datestamps on the source files indicate that some of the errata items were added between late 1991 and April 1993. There are major differences between E1 and E2. Specifically the Urgent Data handling is different (simpler and better) in E2, and OSPF and Multicasting have been added. There are probably other changes. E1 is a Sun-3 port with drivers for Sun3 console, LANCE ethernet and contains 68000 assembly code. E2 is a 80x86 port with drivers for PC hardware supporting Intel EtherExpress and SMC Ultra ethernet cards, and contains 80386 assembly. There is no release of code exactly matching the second edition of the book. The "XINU-PENTIUM" code is close, but has 14 TCP files with newer datestamps than the publishing date of the book. The newer files are tcpsync.c, tcpxmit.c, tcp_in.c, tcprmss.c, tcpackit.c, tcpdemux.c, tcplisten.c, tcpreset.c, tcpsend.c, tcpcksum.c, tqdump.c, tcpinp.c, tcpstat.c and tcpbind.c. Differences are usually minor. For instance, tcpcksum() is passed an extra parameter in the newer code. E3 is an ANSI version of the E2 code with bugfixes. OTHER VERSIONS -------------- There are also SPARC, M68K, PDP11, VAX and MAC code sources in "ftp.cs.purdue.edu/pub/Xinu". The SPARC version might contain virtual memory extensions. The M68K file is huge, but only because it includes all object and binary files! The sources are far older than the 7.9 release, and don't even contain TCP/IP. Stay away from them. BUGS IN TCP/IP -------------- Bugs are listed with the file they appear in, the version of code they apply to and an indication of the severity. ---------- T.1. File: tcpd/tcpread.c Version: E1 and E2 Symptom: tcpread() calls complete when they shouldn't Severity: Minor tcpread.c contains: } else if (len > ptcb->tcb_rbcount && ptcb->tcb_flags & TCBF_BUFFER && >>>> (ptcb->tcb_flags & TCBF_PUSH|TCBF_RDONE) == 0) { signal(ptcb->tcb_mutex); goto retry; Operator precedence puts & ahead of |, so the above will NEVER evaluate to TRUE and tcpread will never block until a read buffer is filled. ---------- T.2. File: ip/ipfhcopy.c Version: E1 and E2 Symptom: May crash fragmenting IP datagrams. May generate bad IP fragments. Severity: Major ipfhcopy.c contains the code: if (otype & IPO_COPY) { blkcopy(&pepto->ep_data[hlen], pepfrom->ep_data[i-1], olen); hlen += olen; That doesn't copy the DATA, but copies from where the data is pointing TO (memory locations -128 to +127 or 255), it should be "&pepfrom...". ---------- T.3. File: tcp/tfcoalesce.c Version: E1 and E2 Symptom: May crash or corrupt memory Severity: Major tfcoalesce contains the code: struct tcpfrag *tf; ... if (ptcb->tcb_rnext == ptcb->tcb_finseq) goto alldone; ... alldone: do freemem(tf, sizeof(struct tcpfrag)); while ((tf = (struct tcpfrag *) tcpdeq(ptcb->tcb_rsegq)) != NULL); If the first test passes (the most common case) it then goes and calls freemem() on an uninitialised pointer. It should initialise tf to NULL and only free it if it isn't NULL. ---------- T.4. File: tcp/tfinsert.c Version: E1 and E2 Symptom: TCP Sequence number sign problem, may drop session Severity: Major tfinsert contains the code: if (enq(ptcb->tcb_rsegq, tf, -tf->tf_seq) < 0) This inserts the item on the queue using the following in enq(): /* start at tail and move towards head, as long as key is greater */ while(i >= 0 && key > qp->q_key[i]) This can fail when the sequence number changes sign and there is already an item on the queue. The new item is likely to be put in the wrong order in the queue, leading to failure to reassemble or worse. This bug could be fixed by changing the way enq handles keys, but this might break other code that makes assumptions about enq. Otherwise the segmentation code could be rewritten to do its own queuing or it could put items on the queue using offsets from a particular sequence number the way the E1 Urgent code does. Probably the best way to fix this is to change enq() to use the same sequence-number comparison that the TCP code uses. Thus change: while(i >= 0 && key > qp->q_key[i]) { --i; } to while(i >= 0 && (key - qp->q_key[i]) > 0) { --i; } ---------- T.5. File: tcpd/tcpread.c Version: E1 and E2 Symptom: None, unless major changes made during porting Severity: None tcpread contains the code: if (state != TCPS_ESTABLISHED && state != TCPS_CLOSEWAIT) return SYSERR; tcpwrite contains the code: if (state != TCPS_ESTABLISHED && state != TCPS_CLOSEWAIT) return SYSERR; At first glance these appear to be wrong, as it is usually legal in TCP to write but not to read in CLOSEWAIT. The code is correct as the read is required to return the status (that the state machine is in CLOSEWAIT). It is not permitted to read in FINWAIT as there is no Xinu equivalent to the Unix shutdown(), and so it isn't possible to legally be in these states. Anybody adding a shutdown() call to Xinu will have to change these tests. ---------- T.6. File: tcp/tcpsynrcvd.c Version: E1 and E2 Symptom: Could cause problems when multiple requests made on listener Severity: Minor tcpsynrcvd contains the code: if (pcount(pptcb->tcb_listenq) >= pptcb->tcb_lqsize) { TcpAttemptFails++; signal(pptcb->tcb_mutex); return tcbdealloc(ptcb); } When the listen queue is full, the received TCP is deallocated, instead of the session being CLOSED. The Close is never signaled through to the other end. Thus it requires the other end to send data in order to get RST segments. If the client only listens for responses (for instance calling a daytimed) it will wait "for ever". Changing the code to the following fixes the problem. if (tcplenq(pptcb->tcb_listenq) >= pptcb->tcb_lqsize) { TcpAttemptFails++; tcpreset(pep); signal(pptcb->tcb_mutex); return tcptcbdealloc(ptcb); } Except it causes another bug. See T.24. ---------- T.7. File: tcp/tcpabort Version: E1 and E2 Symptom: FINWAIT1 and CLOSING locking up. Severity: Major If there's no response to the FIN sent in FINWAIT1 or CLOSING, tcpabort is called and the socket remains in these states for ever. The same happens if RST or SYN are received in these states. This case (not getting ACK to final FIN) is only handled properly in the LASTACK state. This can be fixed in tcpabort(). In the case of the above two states the socket is forced to TCPS_TIMEWAIT and tcpwait() is called. The following code fixes the bug: Original code: tcpkilltimers(ptcb); ptcb->tcb_flags |= TCBF_RDONE|TCBF_SDONE; Changed to: ptcb->tcb_flags |= TCBF_RDONE|TCBF_SDONE; tcpkilltimers(ptcb); if ((ptcb->tcb_state == TCPS_FINWAIT1) || (ptcb->tcb_state == TCPS_CLOSING)) { ptcb->tcb_state = TCPS_TIMEWAIT; tcptcbfreebufs(ptcb); tcpwaittime(ptcb); } ptcb->tcb_flags |= TCBF_RDONE|TCBF_SDONE; ---------- T.8. This bug seems to have gone missing. I don't know how this happened. ---------- T.9. File: tcp/tcpacked.c Version: E2 only Symptom: TCP can't send data at all Severity: Major Discovered by fwmiller@cs.umd.edu (Frank W. Miller) and sent to comp.os.xinu on 22nd August, 1997: tcpacked contains the following code: if ((ptcb->tcb_flags & TCBF_SUPOK) && SEQCMP(ptcp->tcp_ack, ptcb->tcb_supseq) >= 0) ptcb->tcb_sbstart = (ptcb->tcb_sbstart+acked) % ptcb->tcb_sbsize; As written, it only advances the tcb_sbstart pointer (current starting point in the send buffer) when the send urgent pointer is valid. This looks like an error caused by a misapplied patch. The lack of indentation on the third line indicates that the previous line has been deleted. The missing line (insert after line starting "SEQCMP" above) is: ptcb->tcb_flags &= ~TCBF_SUPOK; Without this line, if the SUPOK flags gets set it never gets cleared. ---------- T.10. File: tcp/tcpsndlen.c Version: E1 and E2. E1 is far worse. Symptom: Low throughput (10-30kbytes/sec) when far-end flow-controlled Severity: Major. Disobeys RFC-1122. In the E2 code, the calculation for the length of data to send is as follows: if (rexmit || (ptcb->tcb_code & TCPF_SYN) *poff = 0; else *poff = ptcb->tcb_snext - ptcb->tcp_suna; datalen = ptcb->tcb_scount - *poff; datalen = min(datalen, ptcb->tcb_swindow); datalen = min(datalen, ptcb->tcb_smss; The amount of data that a TCP is allowed to send is best stated in RFC1122, section "4.2.3.4 When to Send Data" as: The "useable window" [TCP:5] is: U = SND.UNA + SND.WND - SND.NXT i.e., the offered window less the amount of data sent but not acknowledged. The above equation is easier to understand if written as: U = SND.WND - (SND.NXT - SND.UNA) Xinu sends the minimum of the data available to send and the CURRENT send window. It should calculate the OFFERED send window, which is ptcb->tcb_swindow - (ptcb->tcb_snext - ptcb->tcp_suna). Under some circumstances it can overrun the receiver and send beyond the offered window. Changing the second last line to the following improves this: datalen = min(datalen, ptcb->tcb_swindow - *poff); The above code will send zero bytes of data into a closed window when in PERSIST state. In the E1 code the same code is as follows: if (rexmt || ptcb->tcb_sudq == EMPTY) { if (rexmt || (ptcb->tcb_code & TCPF_SYN)) *poff = 0; else *poff = ptcb->tcb_snext - ptcb->tcb_suna; datalen = ptcb->tcb_sbcount - *poff; /* remove urgent data holes */ if (!rexmt) { datalen = tcpshskip(ptcb, datalen, poff); datalen = min(datalen, ptcb->tcb_swindow); } return min(datalen, ptcb->tcb_smss); } In the case where the remote window is full, and TCP is in TCBF_PERSIST state, it is calling tcpsend which calls tcpsndlen with rexmit TRUE. The E2 code returns a datalen of zero and poff of zero. The E1 code returns tcp_smss which is either 1440 or 576. It thus sends 1440 bytes (or however much is pending if less) into a closed window which is against RFC-1122's recommendations. The remote drops the data and transmission is stalled until the sender's retransmit timer goes off. This drops throughput by a factor of 40. The fix for this is to move the line setting datalen against tcp_swindow down one line so it is outside the if statement, and to change it as shown for the E2 code case to add the "offered window" check. ---------- T.11. File: tcp/tcppersist.c Version: E1 and E2 Symptom: Stunningly low throughput (3 bytes/sec) when near-end flow-controlled Severity: Major tcppersist contains the code: if (event != PERSIST) return OK; /* ignore everything else */ The above means that when in PERSIST state, sending one packet every minute, it is also limiting itself to sending only one ACK per minute. When a segment arrives that require an ACK to be sent, tcpdodat() sets TCBF_NEEDOUT, and this causes tcpdata() to call tcpkick(), which puts a SEND event on the TCP output port. This is fine unless the transmitter is in PERSIST mode, in which case the above code ignores the SEND. If the PERSIST timer is wound up to its maximum value, this means that only one ACK gets sent per minute, and only one segment is able to be received per minute. This also does wonders to the sender's RTT estimation (it is now 1 minute). Instead of a megabyte/second I've measured 200 bytes/MINUTE. Allowing SEND in tcppersist() fixes this, as long as the persist timer is only increased on PERSIST events. This is not a complete fix as it doesn't handle RETRANSMIT timeouts properly. ---------- T.12. File: tcpd/tcpwr.c Version: E1 and E2 Symptom: Throughput lower than necessary. May give "silly windows syndrome" with other TCP's that don't have receiver-side avoidance code. Severity: Minor tcpwr is documented in the E3 book at 17.7.3 as performing "silly windows avoidance". The code performs: if (isurg || ptcb->snext == ptcb->tcb_suna) tcpkick(ptcb); This is "Congestion Control" as documented by Nagle in RFC-896 but doesn't include SWS as documented in RFC-1122 (see next bug). The code will run faster if it calls tcpkick when it can send a full-sized segment as recommended by RFC-1122. ---------- T.13. File: tcp/tcpsend.c Version: E1 and E2 Symptom: May give "silly windows syndrome" with other TCP's that don't have receiver-side avoidance code. Severity: Minor (disobeys "MUST" item in RFC-1122) The transmitting code doesn't perform Sender SWS-Avoidance, as is required by RFC-1122. It will send tinygrams whenever the receiver window opens. It thus depends on the receiver performing receiver SWW avoidance. ---------- T.14. File: tcp/tcpgetdata.c Version: E1 and E2 Symptom: Serious drop in throughput, down to 2 kbytes/sec when receiver is flow-controlling sender. Severity: Minor (disobeys "MUST" item in RFC-1122) The following code in tcpgetdata sends an ACK when the window opens: /* * open the receive window, if it's closed and we've made * enough space to fit a segment. */ if (SEQCMP(ptcb->tcb_cwin, ptcb->tcb_rnext) <= 0 && tcprwindow(ptcb)) { ptcb->tcb_flags |= TCBF_NEEDOUT; tcpkick(ptcb); } It only sends one ACK when the window opens FROM ZERO. When interoperating with BSD-derived code (including Win-NT), BSD's SWS-avoidance code stops it from sending small segments into a closing window until 5 SECONDS have elapsed. Since Xinu doesn't send any ACKs in this case throughput drops to 2k bytes/second. The fix for this involves changing tcprwindow() so that it can be told not to update ptcb->tcp_cwin and to have the above code changed to send ACKs whenever the advertised (in the previous ACK) window increases by 2*MSS or BUFSIZE/2. ---------- T.15. File: tcp/tcprwindow.c Version: E1 and E2 Symptom: Minor decrease in performance. Severity: Minor. Change makes code match RFC-1122. tcprwindow contains the code: /* * Receiver-Side Silly Window Syndrome Avoidance: * Never shrink an already-advertised window, but wait for at * least 1/4 receiver buffer and 1 max-sized segment before * opening a zero window. */ if (window*4 < ptcb->tcb_rbsize || window < ptcb->tcb_rmss) window = 0; If the above is changed from "||" to "&&" (and the comment changed from "buffer and" to "buffer or") it matches RFC1122 and sends window updates earlier. ---------- T.16. File: tcp/tcpsend.c Version: E1 and E2 Symptom: Code can't send sensible Urgent Data. Severity: Minor. tcpsend contains the code: if (ptcb->tcb_flags & TCBF_SUPOK) { unsigned short up = ptcb->tcb_supseq - ptcp->tcp_seq; if (up >= 0) { ptcp->tcp_urgptr = up + 1; /* 1 past end */ ptcp->tcp_code |= TCPF_URG; } else ptcp->tcp_urgptr = 0; } else ptcp->tcp_urgptr = 0; "unsigned short up" can never be less than zero, so if urgent data is ever sent the pointer will always be set to something, even when not legal. Together with a previous bug which never cleared the TCBFSUPOK flag, once set it always reports urgent data, as far as 65k ahead of the current sequence number. The variable must be long and must be compared against the correct limits for a 16-bit pointer. ---------- T.17. File: tcp/tfinsert.c Version: E1 and E2 Symptom: Compiler warnings Severity: Minor tfinsert is defined as: int tfinsert(ptcb, seq, datalen, gotfin) tfinsert is defined with four parameters, 4th not used, only called with three. ---------- T.18. File: tcp/tcpopts.c Version: E1 and E2, fixed in E3 Symptom: Lockup with Windows 98 Severity: Major Tcpopts will go into an infinite loop if any TCP options other than the ones defined in RFC793 are received (options 0, 1 and 2). Options documented in RFC1700 now go up to 18, and Windows 98 seems to use Option 4 (SACK Permitted). This is fixed in E3 by the addition of the code between "default" and "break" below: default: popt++; /* skip option code */ if (*popt > 0 && *popt <= popend - popt - 1) popt += *popt - 1; else popt = popend; /* bogus option length */ break; ---------- T.19. File: tcp/tcpfin1.c and tcpclosing.c Version: E1, E2 and E3 Symptom: TCP locks up in FINWAIT2/PERSIST, can corrupt files Severity: Major The symptom is that TCP sockets sometimes lock up in FINWAIT2 and PERSIST states. This is "impossible" as PERSIST means there is more data to transmit and FINWAIT2 means that all data including the FIN has been acknowledged. I have had FTP sessions doing this, and also transferring corrupted (short) files while not locking up. Summarising the "expected sequence of events": tcpclose() sets TCBF_SNDFIN and transitions to FINWAIT1. tcpsend() sees TCBF_SNDFIN and sets TCPF_FIN. tcpfin1() calls tcpacked() which clears both TCBF_SNDFIN and TCPF_FIN. tcpfin1() waits until TCPF_FIN is cleared and transitions to FINWAIT2. tcpclosing() also waits on TCPF_FIN and transitions to TIMEWAIT. The above sequence assumes that tcpclose() always calls tcpsend() (via tcpkick(), psend(), signal(), tcpout() and tcposwitch()). It also assumes that tcpsend() will immediately set TCPF_FIN. The first assumption can be broken, but only by changing the process priorities or by changing the implementation (running the TCP code on a different operating system). It may also fail if the stack is using delayed acks. The second assumption fails when TCP is sending into a closing (or closed) window and doesn't send the FIN or set TCPF_FIN. This is how it gets into FINWAIT2 *and* PERSIST. tcpsend() only sets TCPF_FIN if the following test passes: if ((ptcb->tcb_flags & TCBF_SNDFIN) && SEQCMP(ptcp->tcp_seq+datalen, ptcb->tcb_slast) == 0) ptcb->tcb_code |= TCPF_FIN; The fix is to change the test in tcpfin1() and tcpclosing() to test for both flags. There are three places: tcpclosing(): if (((ptcb->tcb_code & TCPF_FIN) == 0) && >>> ((ptcb->tcb_flags & TCBF_SNDFIN) == 0)) { ptcb->tcb_state = TCPS_TIMEWAIT; tcptcbfreebufs(ptcb); tcpsignal(ptcb->tcb_ocsem); /* wake closer */ tcpwaittime(ptcb); } tcpfin1(): if (ptcb->tcb_flags & TCBF_RDONE) { /* FIN received from remote */ if ((ptcb->tcb_code & TCPF_FIN) || /* Our FIN not ACKed*/ >>> (ptcb->tcb_flags & TCBF_SNDFIN)) { ptcb->tcb_state = TCPS_CLOSING; } else { ptcb->tcb_state = TCPS_TIMEWAIT; tcptcbfreebufs(ptcb); tcpsignal(ptcb->tcb_ocsem); /* wake closer */ tcpwaittime(ptcb); } } else if (((ptcb->tcb_code & TCPF_FIN) == 0) && >>> ((ptcb->tcb_flags & TCBF_SNDFIN) == 0)) { tcpsignal(ptcb->tcb_ocsem); /* wake closer */ ptcb->tcb_state = TCPS_FINWAIT2; } ---------- T.20. File: netwrite.c Version: E1, E2 and E3 Symptom: Buffer leak in IP Severity: Minor (unless CLASSD addresses are being generated) The netwrite() function is meant to dispose of all buffers sent to it. If it (or any of the routines it calls) fails to deliver a buffer it returns SYSERR. In the following case it seems to get this wrong: if (IP_CLASSD(pep->ep_nexthop)) { restore(ps); return SYSERR; } The above code should be replaced by: if (IP_CLASSD(pep->ep_nexthop)) { freebuf(pep); restore(ps); return SYSERR; } Note that "Class D" addresses are used by IGMP. It looks from the above code that IGMP may be "present but unsupported", as it looks as if all Multicast packets sent through netwrite() will be dropped. ---------- T.21. File: tcp/tcpdemux.c Version: E1 and E2 Symptom: TCP won't work with source TCP ports over 32767 Severity: Major The port in a TCP packet is declared to be unsigned, but the stored port number in the TCP TCB is signed, as follows: tcp.h: unsigned short tcp_sport; /* source port */ tcp.h: unsigned short tcp_dport; /* destination port */ tcb.h: short tcb_rport; /* remote TCP port */ tcb.h: short tcb_lport; /* local TCP port */ The code in tcpdemux.c that demultiplexes incoming TCP packets to the correct socket is: if (ptcp->tcp_dport == tcbtab[tcbn].tcb_lport && ptcp->tcp_sport == tcbtab[tcbn].tcb_rport && blkequ(pip->ip_src, tcbtab[tcbn].tcb_rip, IP_ALEN) && blkequ(pip->ip_dst, tcbtab[tcbn].tcb_lip, IP_ALEN)) { break; } If the source port is negative the above comparison fails, and TCP responds with a RESET segment for each packet after the SYN packet. Ciscos tend to use the high source port numbers. Changing the definitions for tcb_lport and tcb_rport in tcb.h fixes the bug. This has already been done in the E3 code. ---------- T.22. File: ip/ipreass.c Version: All Symptom: IP Reassembly doesn't follow RFC791 Severity: Minor Spotted by Charles Esson . RFC791 calls for the Datagram ID used during reassembly to be the Source and Destination IP Addresses, the Protocol field and the ID. Xinu ignores the Protocol and the Destination IP address. This is unlikely to cause problems in practice. ---------- T.23. File: ip/ipfhcopy.c Version: All Symptom: IP Fragmentation can corrupt header length Severity: Major Spotted by Charles Esson . The ipfhcopy() code conditionally copies options from one IP packet to another (pepfrom to pepto). It never updates the IP Header Length field in the packet, and neither does any other code in Xinu. Thus if an option isn't copied the packet will be corrupted as the data in the packet starts where the options finished, but the IP header still has the old (longer) length in it. ---------- T.24. File: tcp/tcpsynrcvd.c Version: All Symptom: TCP rejects connections to LISTEN socket under load. Severity: Major Sockets opened as "listeners" have a connection queue, with a default depth of 5. Other TCPs (Linux, BSD) simply ignore incoming SYN requests when the queue is full. Xinu will accept an "unlimited" number of SYNs, returning SYN-ACKs and transitioning the sockets to the SYNRECVD state. Sockets are not put on the queue at this stage. It is only when the next ACK arrives that tcpsynrcvd() attempts the enqueue the socket, transition it to ESTABLISHED and signal the user to ACCEPT the connection. If the queue is full, the socket is closed and a RST segment is sent. This closes sockets that the remote end considers to be "fully open". This can best be fixed by moving this check from tcpsynrcvd to tcplisten, which on receipt of a SYN has to manually count all of its "child" sockets in TCPSYNRCVD state and add the queue depth. If the queue size is exceeded (for all the currently queued sockets plus the ones in TCPSYNRVCD state which will be queued later), the packet is ignored so the client will retry. This bug was caused by the "fix" in T.6, but prior to T.6 it did something worse. ---------- T.25. File: tcp/tcpkick.c Version: All Symptom: TCP Delayed ACK code broken Severity: Major Delayed ACKs are off be default in the XINU code. This disobeys RFC 1122. If enabled, TCBF_DELACK delays all ACKS by 1/5 second, and prevents ACKS from being sent for every second segment. This would limit throughput to one window-full for every 1/5 second (or 128 kbytes/second). TCBF_DELACK also delays all outbound DATA packets for 1/5 second as well. ---------- T.26. File: tcp/tcpostate.c Version: All Symptom: Sockets lock up in CLOSING state Severity: Major The tcpostate() function contains the following code: if (ptcb->tcb_sbcount == 0) { ptcb->tcb_ostate = TCPO_IDLE; return OK; } This sends the transmitter IDLE if all the data has been acked. It does not check to see if the FIN is waiting to be acked. Thus a socket that sends data, closes, sends a FIN, receives a FIN and then had the data (but not the FIN) acked will end up in CLOSING state but with its transmitter IDLE. It remains there forever as it never retransmits the FIN and the TIMEWAIT timer isn't running either. The test should be: if ((ptcb->tcb_sbcount == 0) && ((ptcb->tcb_code & TCPF_FIN) == 0)) { ---------- T.27. File: tcp/tcpiss.c Version: All Symptom: Initial sequence number selection bad. Severity: Minor RFC 793 recommends that the initial sequence number be generated from a 32-bit clock incrementing every 4 microseconds. RFC 1948 recommends cryptographic generateion of these numbers to defend against sequence number attacks. Xinu starts with "clktime" and increments by 904 for each connection. It should use the clock for every connection and should also add some randomness. ---------- BUGS IN PARTS OTHER THAN TCP/IP ------------------------------- X.1. File: shell/x_snmp.c Version: E1 and E2 Symptom: SNMP code may fail to detect EOF or may fail on DELETE Severity: Minor x_snmp.c contains the code: char ch; if ((ch = getc(stdin)) == EOF) Variable "ch" should be int. It may not match EOF or may match on DELETE. ---------- X.2. File: rfs/mount.c Version: E1 and E2 Symptom: May crash on some input Severity: Minor mount.c contains the code: if (prefix == NULL) prefix == NULLSTR; if (replace == NULL) replace == NULLSTR; The second sets of "==" should be "=". ---------- X.3. File: shell/x_ifstat.c Version: E1 Symptom: Compiler warning Severity: Minor x_ifstat.c contains the variable: char *str[80]; That's meant to be (and is used as) "char str[80]". XINU TASK SWITCHING AND INTERRUPTS PROBLEMS ------------------------------------------- This documents what the E1 (Sun3) code does. I don't know if the PENTIUM (E2 dated) code is the same, but it looks like it is (only perhaps worse). It looks like the Xinu scheduling system only allows 10 "events" to run per second. This may explain why the some of the TCP throughput bugs weren't found as they were concealed by scheduling problems. ## Note I'm not 100% sure about the way the code works. I've already ## corrected these notes twice after people have pointed out ## misunderstandings. if I've got something wrong PLEASE tell me. Interrupt service routines usually operate as follows: 1. A hardware interrupt occurs, 2. The Interrupt Service Routine runs, and then 3. Reads data from hardware and stores in a structure shared with the rest of the driver, and then 4. Signals the operating system to schedule the driver code, which 5. Runs at non-interrupt time, handles the data and signals the operating system to schedule the user code that wants the data. Xinu supports the above scheme by using signals to schedule the other parts of the code. This will work as expected AS LONG AS the interrupt service routine process priority is HIGHER than the driver priority. If it is the same or lower, the signal will call-through (will immediately run) the mainline driver code, resulting in that code running DURING the interrupt-service routine. Worse, when the context switches, the saved status register is restored and this ENABLES HARDWARE INTERRUPTS again. This is from a call in the middle of an interrupt service routine. This can't possibly work properly. Xinu calls ready() with RESCHYES from signal(). Thus signal() will switch to the signaled process if it is the same or higher priority. The Ethernet interrupt service routines set themselves to NETIPRI which matches the network priority NETPRI. Thus ethinter() calls ethdemux() which calls ni_in() calls ip_in() which calls send() which calls ready(RESCHYES) which runs (say) TCP or IP. Both TCP and IP are created with IPPRI and TCPPRI, which are also 100, so the IP and TCP code will ALSO run in the interrupt context. When does this stop? The user code runs under the shell, which is run at SHCMDPRI (20). Thus the Ethernet interrupt service routine has run code all the way to TCP, which may then transmit a frame back through TCP, IP and the Ethernet code, all under the Ethernet receive interrupt context. When does the user code run? It can't get scheduled from the idle task as idle never gives up the CPU. User code has to wait until the next non-clock interrupt runs, or until the clock interrupt preemption code notices nothing has happened for a whole second (1/10 second on PENTIUM code) and calls resched(). If a process is woken up by the timer (through wakeup()) then resched() is run by this and things will fly again. Any other interrupt (serial, ethernet, keyboard) will "accidentally" cause user code to run as part of their signaling to the other half of the drivers. It is likely that user code, TCP, IP and Ethernet driver code will run under a serial port context, blocking all following serial port interrupts until control returns to the interrupt routine (it of all interrupt routines doesn't reenable hardware interrupts until it exits). No instructions can ever run that aren't run under an interrupt context because of the above priorities and the following code in nulluser: while (TRUE) { /* run forever without actually */ pause(); /* executing instructions */ } The above behavior could be fixed if all interrupt service routines changed their runtime priorities to something very high (say 1000) and the idle loop (nulluser) was changed to: while (TRUE) { /* run forever without actually */ pause(); /* executing instructions */ disable(ps); resched(); restore(ps); } With this change the interrupt code would run until the first send(), signal() or resched() and then return to nulluser() where the above resched() would immediately run all the appropriate processes. This would minimise the time spent in interrupts, and spent with interrupts being locked out by other interrupts, and would result in most of the code being switched from the idle context. NOTE I used to suggest calling chprio() to change the priority of the interrupt service routine, but chprio() calls resched() and that has unwanted side effects. Directly modifying "proctab[currpid].pprio" in the interrupt routine seems to work: > Subject: Re: resched() analysis... > Date: Fri, 04 Sep 1998 07:55:49 -0500 > From: Tom McDermott > Organization: Alcatel Network Systems > Newsgroups: comp.os.xinu > > Have implemented the code: > > while(TRUE) { > disable(ps); > resched(); > restore(ps); > } > > and that does the trick! That, along with the changing of > priorities, using direct manipulation of the proctab structure has the > PING performance where it should be -- about 2-3 milliseconds. Seems > stable, no lock-ups or crashes after one-two hours of testing. It looks like the Pentium Ethernet code fails to change the priority at all. This gives the same effective result as in the SUN3 case. There's no code in the Pentium source that changes the priority. The Pentium idle process is simpler than the SUN one. It is: while (TRUE) /* empty */; It executes the above loop all the time it isn't servicing interrupts. As with the SUN3 one it should be calling resched(), preferably after calling pause(). The following excerpts illustrate the problems with the scheduling system that led to the above change recommendations. Both were caused by the interrupt service routines running at a low (zero) priority and then calling through to user code that blocked, leaving the ready queue empty. Tom McDermott [mcdermot@aud.alcatel.com] (Mon 24/08/98 10:45 PM) stated (concerning adding the resched() in the idle task): > I tried this, but it introduces a problem in that after some thousands > of passes through the resched loop, resched encounters a pass when the > readylist is empty - this of course instantly crashes XINU. > My guess is that there is a bug somewhere else in XINU that allows > the condition of an empty readlist, and the pause(); kludge seems to > solve it. Any experience with this? This seems to be the same as... Frank Miller (fwmiller@cs.umd.edu, http://www.cs.umd.edu/~fwmiller) has spotted/suffered a problem where an interrupt service routine interrupts the IDLE task and calls through to code that BLOCKS. There is now no routines on the READY list and the code crashes unless special code is added to cope with this case. ---------- CODING PRACTICES ---------------- These are practices that GCC complains about. These aren't bugs, but this is how I found most of the real ones. All of these are in the E1 sources. tcpd/tcpcon.c:42: if (error = ptcb->tcb_error) tcpd/tcpmopen.c:34: if (error = tcpcon(ptcb)) tcp/tcprmss.c:29: ptcp->tcp_offset = ((hlen<<2) & 0xf0) | ptcp->tcp_offset & 0xf; The code is OK, but it is necessary to add brackets to shut gcc up... tcp/tcprtt.c rrt = tmclear(tcps_oport, MKEVENT(RETRANSMIT, ptcb-&tcbtab[0])); The definition of MKEVENT in h/tcpfsm.c: #define MKEVENT(timer, tcb) ((tcb<<3) | (timer & TMASK)) Gcc complains about the resulting "ptcb - &tcbtab[0] << 3" The definition of MKEVENT needs ()s around parameters. tcpok.c rv = (ptcp->tcp_seq - ptcb->tcb_rnext) >= 0 && (ptcp->tcp_seq - wlast) <= 0; rv |= (slast - ptcb->tcb_rnext) >= 0 && (slast - wlast) <= 0; The above is correct, but opaque. Note the use of "|=" to combine two boolean conditions.