next up previous contents
Next: Loopback STREAMS Driver Up: Experimental Results Previous: Experimental Results   Contents

DLPI, TLI, and Sockets over Fast Ethernet

Figure 3.1: This figure shows average round-trip time for TLI, Sockets, and DLPI on Solaris over Fast Ethernet.
\scalebox{0.45}{\includegraphics{solaris_rtt.ps}}

Figure 3.1 shows a comparison of round-trip latencies for different message sizes using three different STREAMS-based communication APIs (DLPI, TLI, and Sockets) on Solaris over Fast Ethernet. These APIs use different protocol stacks as shown in Figure 3.2.

Figure 3.2: This figure depicts the stacks used in the Solaris DLPI and TLI tests.
\includegraphics[width=4in]{dlpi_tli_arch.ps}

In these tests, DLPI is used without any network or transport modules. The user application interfaces with the Stream head via system calls, and the Stream head communicates directly with the Ethernet driver. Both TLI and Sockets use UDP/IP protocol stacks. There has been some debate [27] in the community regarding the efficiency of TLI and Sockets. Some have claimed that TLI stacks are faster since Sockets are implemented on top of STREAMS, but this is clearly not true (since both are implemented on top of STREAMS) [27]. The results shown in Figure 3.1 indicate that TLI and Sockets give equal performance. This implies that the modules (TIMOD, SOCKMOD) and libraries (socklib and XTI/TLI) provide equivalent performance. From Figure 3.1 we see that the DLPI results provide a baseline for the best performance (since there are no intermediate protocol modules between the Stream head and the STREAMS Ethernet driver) that can be achieved with the STREAMS subsystem. For DLPI, the system calls putmsg() and getmsg() were used. An average round-trip time improvement of 38 microseconds was obtained using DLPI over Sockets or TLI.

Figure 3.3 shows the throughput and CPU usage obtained using DLPI and TLI over Fast Ethernet. Once again, there is less overhead using DLPI, and it commands better throughput overall than TLI. The stripped-down DLPI stack also incurs less CPU usage as seen in Figure 3.3. Around the 500 byte message mark, CPU sender percentage for DLPI and TLI is markedly different. The lower CPU usage in DLPI can be attributed to the fact that it is not using the UDP and IP STREAMS modules, and it also bypasses system libraries for TLI. In addition, because both DLPI and TLI (using UDP) are connectionless, there is no flow-control mechanism in place, and many packets are dropped by the receiver. This accounts for why sender CPU usage is much higher than receiver in both types of tests.

Figure 3.3: This figure shows the throughput achieved for TLI, Sockets, and DLPI on Solaris over Fast Ethernet. CPU usage is also shown.
\scalebox{0.45}{\includegraphics{solaris_throughput_cpu.ps}}

As stated previously, a STREAMS implementation written from scratch has been implemented by Mentat, Inc. This fresh implementation should give some idea of how a ground-up STREAMS implementation performs. Mentat has provided a port for Windows NT called Mentat Portable Streams (MPS) [18] for the Windows NT environment. This allows one to completely bypass the native Winsock stack, and run network code in the NT environment that utilizes a STREAMS-based stack. 3.1. Unfortunately, the MPS product does not provide a STREAMS-based transport or network provider (TCP/UDP/IP modules). Therefore, a direct comparison between DLPI/TLI and Sockets is impossible. UDP/IP tests were run using the Sockets API anyway, for reference, but it should be noted that these tests used the native Winsock stack. In addition, TLI posed the same problem of lacking a transport and network provider. However, MPS provides a specialized TLI/XTI module which converts STREAMS messages sent via TLI/XTI to DLPI messages. Figure 3.4 shows the STREAMS-based stack on Windows NT using MPS. User applications can use the TLI/XTI libraries or write code that conforms to the DLPI standard. A specialized DLPI "shim" module sits directly above the network device driver, and converts DLPI messages to the Windows Network Device Interface Standard (NDIS). Note that the transport provider (TPI Module) is shaded because it is not included with MPS.

Figure 3.4: This figure depicts the stacks used in the Windows NT DLPI and TLI tests.
\scalebox{0.65}{\includegraphics{nt_arch.ps}}

Figure 3.5: This figure shows average round-trip time for TLI, Sockets, and DLPI on Windows NT over Fast Ethernet.
\scalebox{0.45}{\includegraphics{nt_rtt.ps}}

The TLI and DLPI tests were performed on Windows NT with the stacks shown in Figure 3.4. The results are presented in Figure 3.5. It is interesting to see how Mentat Portable Streams compares to the Solaris stack; however, it is important to point out that a direct comparison can not be made between NT and Solaris for two reasons: they were tested on different hardware, and the protocol stacks are not completely identical. Of interest, however, is the overhead incurred by adding one additional processing module to the stack; namely, the XTI/TLI module (Figure 3.4), which is responsible for converting XTI/TLI to the DLPI standard. Average overhead incurred from having this additional module on the stack is around 11 microseconds. We can make a general, weak comparison between the DLPI test results for Solaris and NT, since these DLPI STREAMS-based stacks are not identical, but similar. The NT stack (DLPI) does include an additional processing module not present in Solaris (the DLPI "shim" module), so we expect a bit more overhead. This is indeed the case, and Solaris performs somewhat better. The Windows Sockets tests were included for completeness, but offer little information in the context of STREAMS, since they were done using the native Winsock APIs. It is interesting to note that Windows Sockets yield a fixed cost latency of around 330 microseconds for the message sizes tested over UDP/IP. Most likely, this is the result of implementation design, as these types of results are consistent with similar tests done using TCP/IP for various message sizes. More specifically, Winsock exhibits a stepwise latency behavior.


next up previous contents
Next: Loopback STREAMS Driver Up: Experimental Results Previous: Experimental Results   Contents
Super-User 2001-05-07