What is SCI (Scalable Coherent Interface)?
What is SCI (Scalable Coherent Interface)?
The Scable Coherent Interface (SCI) is effectively a processor memory and I/O bus, high performance switch, local area network (LAN), and optical network. It is an information sharing and information communication system that provides distributed directory based cache coherency for a global shared memory model. It uses electrical or fiber optic point-to-point unidirectional cables of various widths. A single address space is used to specify data as well as its source and destination when being transported. Performance ranges from 200 Mbyte/s (CMOS) to 1000 Mbyte/s (BiCMOS) over distances of tens of meters for electrical cables and kilometers for serial fibers.
SCI has been designed to be interfacable to common buses such as PCI, VME, Futurebus, Fastbus, etc., and to I/O connections such as ATM or Fiber Channel. It works in complex multivendor systems that grow incrementally, a harder problem that interconnecting processors inside a single product like MPP does. Its cache coherence scheme is comprehensive and robust, independent of the interconnect type or configuration, and can be handled entirely in hardware, providing distributed shared memory with transparent caching that improves performance by hiding the cost of remote data access, and eliminates the need for costly software cache maintanence.
The Dolphin PCI-SCI is shown below:

The usual approach of moving data through I/O-channel or network-style paths, requires assembling an appropiate communication packet in software, pointing the interface hardware to it, and initiating the I/O operation, usually by calling a subroutine. When the data arrive at the destination, hardware stores them in a memory buffer and alerts the processor by an interrupt when a packet is complete or the buffers are full. Software then moves the data to a waiting user buffer and, finally, the user applications examines the packet to find the desired data. Typically, this process results in latencies that are tens to thousands of time slower than SCI.
BANDWIDTH COMPARISION : A single 1 Gbyte/s (8000 Mbit/s) SCI link has about the same bandwidth as 64 of today's 155Mbit/s ATM links, or 32 of todays of today's Fiber Channel 256 Mbit/s links, or 800 Ethernets, or 80 100baseT Ethernets.
For applications that don't need SCI's full speed, the standard protocols also support a ring-style connection that shares the link bandwidth among some number of devices, avoiding the cost of a switch. The Dolphin Wulfkit interconnect typically uses this configuration. This is discussed further in Microway Application Note 16 . Either an individual SCI device or an entire ring can be connected to an SCI switch port, giving a user the ability ot trade off cost versus performance over a braod range. Rings can also be bridged to other rings to form meshes and higher dimensionality fabrics.
Since SCI eliminates the need for runtime layers of software protocol-paradigm translation, it reduces the delay of interprocessor comunication by an emormous factor compared to even the latest interconnect technologies that are based on the previous generation of networking and I/O channel protocols. A remote communication in SCI takes place as just a part of a simple load or store opcode execution in a processor. Typically, the remote address results in a cache miss, which causes the cache controller to address remote memory via SCI to get the data, and in a very short time, in the order of a hundreds of processor cycles, the remote data is fetched to cache and the processor continues execution.
When two tasks share data using SCI, the data remains stored in ordinary variables, with ordinary memory addresses, at all times. Thus, processor instructions like Load and Store suffice to access data for doing computation with it. Load and store are highly optimized in all processors, and the underlying SCI transport mechanism is transparent to the user, performing all the network protocol effectively as a fraction of one instruction.
SCI hides the latency of accessing remote data while other techniques we know of for hiding this latency involve using caches to keep copies of data near its users. Since cache entries are copies of data, they are duplicates and can become outdated when the data changes in value. Only SCI handles this problem, with SCI's distributed cache coherence mechanism. This mechanism adds only a bit or so to the command field in SCI's packets. This does not affect unshared cache entries, but when shared cache entries are modified, the coherence protocol quickly locates all the other copies so they can be discarded, and fresh copies can be fetched if those processors still want the data. The distributed protocol does not add to the traffic at main memory, and it so happens to increase the robustness of a system.
The RISC-style protocols SCI uses are simple in nature, which allow SCI links to deliver much higher performance for any given chip technology. SCI is very efficient with large blocks of data wherein hundreds of processors access data from a common shared memory space. Since a built-in DMA engine is used which has a high setup overhead, the interface is only appropriate for large chunks of data. Alternatively, each processor will have its own memory space, and 32 Kbyte block moves can be performed from the sender to a preallocated buffer in the receiver. This, however, results in a lot of handshaking and in lots of network traffic. Suppose smaller data transfers need to be performed (say 128 bytes), then the data from the sender can be deposited into the final destination and will result in a lot less handshaking traffic.
However, SCI is most useful for clustering over local area distances or less. It is least suitable over long distances. Each interface chip can only handle a certain number of concurrently active packets, which are in flight awaiting confirmed delivery to the destination or an intermediate bridge queue. When that number is less than the number of packets that can be in flight in the cables, efficiency drops. This could be compensated by using several SCI chips in series, to provide more queue storage, but for large distances it makes sense to use a wide area networking approach, such as ATM.
As discussed in Microway Application Note 13 Scyld Beowulf Clusters offer a unique way for process migration from the master to the nodes, and also offer a wide array of commands and GUI utilities to configure and monitor the cluster. They also allow for headless nodes that can be booted up via floopies. However the backplane for communication remains fast-ethernet, gigabit ethernet, or myrinet. Thus all the bandwidth limitations due to these interconnect technologies will kick in. As we shall see later (Microway Working Note 16 ) much of what Scyld Beowulf provides is also achievable via NFS and batch queing. However one advantage of Sycld Beowulf clusters is BeoMPI, a MPI version that is optimized for Scyld Beowulf configurations with MppRun a parallel job creation package. In summary as opposed to multiple system images that a standard Beowulf cluster has the Scyld implementation provides a single system image (with the option of having several front ends) using BProc.
In the end the type of configuration best suited for a given problem will depend on the final implementation. If communication bandwidth is a issue then using alternatives like SCI may help tremendously. As with all message passing paradigms the need for synchronous communication is a must. Thus the overheads due to synchronization and scheduling must not be ignored. It may just be the case that some applications that may involve highly asynchronous communications, these overheads can cause a bottleneck, resulting in great demands on the communication bandwidth. In such cases although there may not large amount of data transfer to deal with standard communication protocols like TCP/IP may not help. In such cases again the use of SCI will greatly speed things up. While both SCI and Sycld implementations promise scalability the final descision will depend on the performance of the end user application compiled with BeoMPI or Scali MPI when run on the respective architechture.
A cache is a block of memory that holds frequently used data or data that is waiting for another process to use it. When a process needs information, it first checks the cache. If the information is already in the cache, performance is improved. If the information is not in the cache, it is retrieved from alternate storage and placed in the cache where it might be accessed again. There are several types of caches and applications that use caches:
Processor cache: A processor cache is a block of memory that is part of the processor itself and has very fast access.
Disk cache: A disk cache is located in a computer’s RAM memory. It holds blocks of information from disk rather than whole files. n Client/server cache In a client/server system, large chunks of data are “shipped” to a cache in the client workstation. Data must be synchronized between the client and server to ensure consistency.
Remote cache: Remote users benefit from cached information since it reduces information exchanges across slow links.
Distributed directory caching: Some distributed file systems cache directory information in users’ workstations to improve access.
Intermediate server cache: In a distributed client/server environment, information may be cached from a back-end server to a workgroup server to improve access for local users who access the same information.
Web server/proxy server cache: Web servers cache often-accessed pages to improve access for Internet users. A proxy server caches information for internal users who visit the Internet. The proxy server is situated at the Internet gateway, and it handles all Internet requests for users. It also caches pages that have been accessed for others to use.
In some distributed file systems (such as the Andrew File System), client workstations maintain a cache on a local hard disk (rather than in RAM memory) for information requested from servers. This cache can become quite large, which introduces consistency problems. There are two methods that help overcome this problem. One model has the client constantly checking with the server to see if information has changed, but this adds a great deal of overhead. Another model uses a call-back approach, in which the server informs clients when information they have in their cache is changed by someone else.
What is Mutliprocessor Cache Coherence? The introduction of caches causes a cache coherence problem for I/O operations, since the view of memory obtained through the cache could be different from the view of memory obtained through I/O subsystem. The same problem exists in the case of multiprocessors, because the view of memory held by two different processors is through their individual caches. The figure below illustrates the problem and shows how two different processors can have two different values for the same location.

This is generally referred to as the cache-coherence problem. The protocols to maintain coherence for multiple processors are called cache coherence protocols. The key to implementing a cache-coherence protocol is tracking the state of any sharing of a data block. There are two classes of protocols, which use different techniques to track the sharing status, in use:
Directory Based: The sharing status of a block of physical memory is kept in just one location, called the directory. A directory keeps the state of every block that may be cached. Information in the directory includes which caches have copies of the block, whether it is dirty, and so on. Exisiting directory implementations associate an entry in the directory with each memory block. In typical protocols, the amount of information is propotional to the product of the number of memory blocks and the number of processors. This is not a problem for machines with less than about a hundred processors, because the directory overhead will be tolerable. For large machines, we need methods to allow the directory structure to be efficiently scaled. The methods that have been proposed either try to keep information for fewer blocks (e.g., only those in caches rather than all memory blocks) or try to keep fewer bits per entry. To prevent the directory from becoming the bottleneck, directory entries can be distributed along with the memory, so that different directory accesses can go to different locations, just as different memory requests go to different memories. A distributed directory retains the characteristic that the sharing status of a block is always in a single known location. This property is what allows the coherence protocol to avoid broadcast.
Snooping: Every cache that has a copy of the data from a block of physical memory also has a copy to the sharing status of the block, and no centralized state is kept. The caches are usually on a shared-memory bus, and all cache controllers monitor or snoop in the bus to determine whether or not they have a copy of a block that is requested on the bus.
Both the snooping or directory schemes uses either the write invalidate protocol (the most common protocol in use) or the write broadcast (write update) protocol. The former method ensures that a processor has exclusive access to a data item before it writes that item. It does this by invlidating other copies on a write. Exclusive access ensures that no other readable or writable copies of an item exist when the write occurs: all other copies of the item are invalidated. In the latter case updates to all the cached copies of a data item is done when that item is written. To keep the bandwidth requirements of this protocol under control it is useful to track whether or not a word in cache is shared - that is contained in other caches. If it is not, then there is no need to broadcast or update any other caches.
BACK TO MICROWAY APPLICATION NOTE INDEX