Installing Scyld Beowulf on a Cluster of Computers - An Example
Scyld Computing Corporation release is based on the Beowulf Distributed Process Space (BPROC) that implements a cluster configuration using fast-ethernet interconnects, and single master and several slave nodes that may or may not be headless. Of the several features that come with the Scyld Beowulf Scalable Computing Release is the fact that the installation of BPROC and a couple of cluster monitoring utilities is automated. They also provide an optimized version of the Beowulf MPI (Message Passing Interface) based on MPICH, called BeoMPI. There is also a parallel job creation utility called MppRun. This is built against the BPROC system and is used to run programs that have not been parallelized internally with the BPROC Machine Information Calls, Process Migration Calls, and System Management Calls. The only changes made in BeoMPI to MPICH are those that take advantage of the BPROC system, and thus porting an existing MPI application based on MPICH is simplified. Some additional features in the Scyld Beowulf implementation include performance counter support and integration of additional nodes in a running cluster. The present release is packaged with Linux (Red Hat 6.2) and is for both x86 and Alpha platforms.
Standard Beowulf clusters still look like a network of workstations (NOW’s). Once logged into the front end of the cluster, the only way to start processes on other nodes in the system is via rsh. MPI and PVM hide this detail from the user but it's still there when either of them starts up. Cleaning up after jobs is often made tedious by this as well. (Especially when the jobs are misbehaving.) The bproc distributed PID space (bproc) addresses these issues by providing a mechanism to start processes on remote nodes without ever logging into another node and by making all the remote processes visible in the process table of the cluster's front end node. The hope is that this will completely eliminate the need for people to be able to login on the nodes of a cluster. BPROC introduces a distributed process ID (PID) space. This allows a node to run processes which appear in its process tree even though the processes are physically present on other nodes. The remote processes also appear to be part of the PID space of the front end node and not the node which they are running on. The node which is distributing its pid space is called the master and other nodes running processes for the master are the slaves. Each PID space has exactly one master and zero or more slaves. Each PID space corresponds to a real PID space on some machine. Therefore each machine can be the master of only one PID space. A single machine can be a slave in more than one PID space. Remote processes on are represented on the master node by "ghost" processes. These are kernel threads like any other kernel thread on the system (i.e. nfsiod, kswapd, etc). They have no memory space, open files, or file system context but they can wake up for signals or other events and do things in kernel space. Using these threads, the existing signal handling code and process management code remains unchanged. Ghosts perform these basic functions: (i) Signals they receive are forwarded to the real processes they represent. Since they are kernel threads, even SIGKILL and SIGSTOP can be caught and forwarded without destroying or stopping the ghost. (ii) When the remote process exits it will forward its exit code back to the ghost and the ghost will also exit with the same code. This allows other processes to wait() and receive meaningful exit status for remote processes. (iii) When a remote process wants to fork, it will need to obtain a PID for the new child process from the master node. This is obtained by asking the ghost process to fork and return the PID of the new child ghost process. (This also keeps the parent-child relationships in sync.) (iv) When a remote process waits on a child, the ghost will do the same. This prevents accumulation of ghost zombies and keeps the process trees in sync. The PID masquerading modifications make it possible for a process to appear as though a process exists in a different PID space. Processes are still part of the single PID space that we're used to, but the PID related syscalls (getpid, getppid, kill, fork, wait) have been modified to treat their arguments differently and to give different responses for processes that have been tagged as masqueraded. PID masquerading also introduces a user space daemon to control some of the PID related operations normally done by the kernel. Each daemon will define a new "PID space" and can create new processes in that space. Operations such as new PID allocation are handled by this daemon. (In the case of a masqueraded process forking a new masqueraded PID will be needed for the child process. This request gets sent out to the user space daemon which will forward it to the master node. The ghost process there will fork and return the child PID it gets back to the slave on the node. The child's new masqueraded PID is set to that PID.) The PID related syscalls will only operate on other masqeraded processes that are in the same PID space (that is they're under control of the same daemon.) Other non-masqueraded processes on the local system become effectively invisible as a result. Signals (kill(2)) that cannot be delivered locally are bounced out to the user space daemon for delivery. This is used in conjunction with ghosts to make it appear as though a piece of the master node's PID space has been moved onto the slave node. When creating remote processes, there are really 2 processes created, a ghost on the master to represent the remote process and the real process on the slave node. The (masqueraded) process on the slave gets the same PID as the ghost on the master node. The daemon controlling the masqueraded process space in forwards requests from the real process back to the master node. This way any operations the real process performs (fork, kill, wait, etc) will performed in the context of the master node's process space. Most requests will be satisfied by the ghost thread. There are two basic ways to start a process in this scheme. The simpler one is the rexec (remote execute) which takes the same arguments as execve plus a node number. This inteface also has roughly the same semantics as the execve system call. This doesn't involve transfering much data, but it does require that all binaries and any libraries they require be installed on remote nodes. The other interface which bproc provides is a "move" or "rfork" interface. This works by saving a process' memory region and recreating it on the remote node. This has the advantage that it can transport the binary and anything mmap'ed (such as the dynmaically linked libraries) to the remote node. This could allow a great reduction in the size of the software required to be installed on a node.
Programs that use the BProc library should contain the line #include
<sys/bproc.h> and be linked against the BProc library by
adding -lbproc to the linker command line.
Machine Information Calls: The BProc library provides the following interfaces for finding information about the configuration of the machine. These interfaces may be used from any node on the cluster.
int bproc_numnodes(void) Returns
the number of nodes in the system. This is the number of slave nodes (not
including the front-end machine). The nodes are numbered 0 through N-1. This
function returns -1 on error.
int bproc_currnode(void)This
call returns the node number on which a process is currently running. -1
indicates that the process is running on the front-end machine.
int bproc_nodestatus(int
node) This function is for use on SLAVE NODES ONLY - not the
front-end machine, since it is always `up'. Returns the status of node number
given node. This function returns -1 on error and errno will be set
appropriately. The value returned is one of the following:
bproc_node_down
bproc_node_unavailable
bproc_node_error
bproc_node_up
int bproc_nodeaddr(int node,
struct sockaddr *addr, int *size)This call
saves the IP address of node in the sockaddr pointed to by
addr. The size parameter should be initialized to
indicate the amount of space pointed to by addr. On return it
contains the actual size of the addr returned (in bytes). This
function returns 0 on success and -1 on failure.
int bproc_masteraddr(struct sockaddr
*addr, int *size)This call is equivalent
to bproc_nodeaddr(-1, addr,
size) Process Migration Calls:
int bproc_rexec(int node, char
*cmd, char **argv, char
**envp)This call has semantics similar to execve. It replaces the current process with
a new one. The new process is created on node and the local process
becomes the ghost representing it. All arguments are interpreted on the remote
machine. The binary and all libraries it needs must be present on the remote
machine. Currently, if remote process creation is successful but exec fails, the
process will just exit with status 1. If remote process creation fails, the
function will return -1.
int bproc_move(int node)
This call will move the current process to the remote node number given
by node. Returns 0 on success, -1 on failure.
int bproc_rfork(int
node) The semantics of this function are designed to
mimic fork except that the child
process created will end up on the node given by the node argument.
The process forks a child and that child performs a bproc_move to
move itself to the remote node. Combining these two operations in a system
call, prevents zombies and SIGCHLD's in the case that the fork is successful
but the move is not. On success, this function returns the process ID of the
new child process to the parent and zero to the child. On failure it returns
-1.
int bproc_execmove(int node,
char *cmd, char **argv, char
**envp)This function allows execution of local
binaries on remote nodes. BProc will start the binary on the current node and
then move it to a remote node, before the binary gets running. NOTE:
This migration mechanism will move the binary image but not any dynamically
loaded libraries that the application might need. Therefore any libraries that
the application uses must be present on the remote system. System Management Calls: The system management calls are made by
programs like bpctl to control the
machine state. These calls are privledged and not useful to normal applications.
int bproc_slave_chroot(int
node, char *path) This call requests the
slave daemon to perform a chroot. This
call returns 0 on success and -1 on failure.
int bproc_setnodestatus(int
node, int status) This call sets the
status of a node. See bproc_nodestatus for information regarding
permissible node states. It is not possible to change the status of a node
which is marked as down.
The process migration using BRPOC's VMADump typically takes a few nano seconds. This results in ease of deploying the jobs to all the nodes, without having to distribute the executibles and libraries. True a script could be written to do this, but again the Scyld implementation of BPROC or BPROC with bWatch also offers a unique way to monitor the nodes. Finally while this is not recommended the nodes could be headless being booted via floopy. Once the RARP requests are completed each node is "up" and available for computing. Future designs may use this feature to create extremely compact clusters with a very large number of CPU's assuming of course the CPU's are as cheap as memory.
Installing Scyld Beowulf on a Cluster of Computers - An Example
Scyld Beowulf is installed on a cluster of computers with the following configuration:
1 Master 3 Slave Nodes: Dual Pentium PIII 800 MHz CPU (i440 GX), 256 MHz PC100 SDRAM, 10 GB ATA 66 Hard Drive, ATI 8MB PCI Video (on Master only), 48 X CDROM (on Master only), 1.44 MB Floppy Drive, 100 Mbs/Sec Intel Ether Express Pro Fast-Ethernet, 3COM HUB.
The CDROM used for the install was the standard SCYLD BEOWULF SCALABLE COMPUTING BOOT/INSTALLATION CD Version 27Bz-7 available from http://linuxcentral.com/. The master must have two NIC's (Network Interface Cards) and the slave at least one. In this case all the hard drives were not partitioned.
First the Master is configured followed by the Slaves. Insert the CDROM in the CDROM drive and boot up (Remember to set the BIOS to boot first from the CDROM, then FLOPPY, then IDE.). Type "install" at the prompt. After booting you are presented with an option (the standard Red Hat Linux 6.2 interface modified with Scyld Logos - BPROC is actually a series of patches to the Linux kernel) of selecting either the standard Scyld Beowulf or custom installation. Select the standard Scyld Beowulf installation. Then you will be prompted if you want to manually partition the drive. This option was not selected. The you will need to specify the network parameters for eth0 (external internet card) and eth1 (internal slave card) for the master. The IP address of eth1 was set at 10.0.0.1 and the range of the slave IP adresses were to be 10.0.0.2 to 10.0.0.4. After this the standard installation completes and you finally reboot. Remember to create atleast one user account with the root account. On reboot login as "root" and type "startx". This launches GNOME, with the Scyld Beowulf Manual, BeoSetup, and Scyld Beowulf Ststus Monitor Windows. BeoSetup and the Status Monitor Windows are shown below.

At this stage the first window is empty and the second has red crosses instead of green ticks. At this time one can create the slave node boot up floopy disks. To do this first select Settings in the BeoSetup window and select Preferences. Here check for the IP range of the slave nodes. This should be as indicated above. Then select File and Create BeoBoot File option. This creates the boot image. Next select File and Create Node Boot Floopy option. Install a blank floppy disk in the floopy disk drive, and create three such ones in sequence. Remember the first floppy is Node 0, then Node 1, and finally Node 2. Insert each of these in the slave and boot up the slaves in the order Node 0, Node 1, Node 2. After booting the slaves the BeoSetup window will show under Unknown addresses the IP addresses of the unkown nodes. Select each of these in turn and drag and drop them into the Configured Nodes box. After this you will see that the red crosses change to ticks. Now shot down the master and the slave nodes starting with the Node 2, Node 1, and Node 0. Then boot up the master and after it has booted, and you have logged in as root, and started GNOME, boot up the slaves starting with Node 0. You will see the screens as shown above. Now you can run the test program from the terminal window by typing at the prompt > NP=8 mpi-mandel. You should see the screen below.

This completes the Scyld-Beowulf installation.
BProc requires a number of kernel modifications
and modules to be installed. Building BProc from scratch means building a kernel
that includes the BProc modifications. Apply the bproc patch to
your kernel. When configuring the new kernel, select "Yes" to `Beowulf
Distributed Process Space'. See the documentation included with the Linux
kernel for more information about configuring and compiling Linux kernels. After
patching the kernel, it is possible to build the rest of the BProc package by
running `make' in the top level bproc directory. The
Makefile presumes that the kernel tree to build against resides in
`/usr/src/linux'. If this is not accurate, provide make
with the `LINUX=/path/to/linux' argument. First, install
the BProc kernel modules. There are three modules which must be loaded in the
following order: ksyscall.o, vmadump.o and bproc.o. After running
depmod, `modprobe bproc' should load them all. These
modules must be loaded on both the front-end machine and the slave nodes. If
using pre-built kernel packages, run the following to install all the programs
and modules to their proper locations.
> make install
> depmod -a
> modprobe
bproc
Note: BProc daemons require `/dev/bproc' to communicate with
the kernel layer. This is a character device with major number 10, minor number
226.
The master daemon, bpmaster is the central part of BProc system.
It runs on the front-end machine. Once it is running, the slave nodes run the
slave daemon, bpslave to connect to the front-end machine.
bpmaster runs on the front-end machine and handles all the details
of running BProc.
> bpmaster
down Nodes are `down' when they
are NOT connected to the BProc master daemon. It is impossible to do anything
to nodes via BProc when they are in this state.
unavailable When a node is `unavailable', it is connected to the
BProc master but has not yet been tagged as ready for users. While in
this state, the system makes no guarantees about the state of the node.
Unavailable usually means that the node is in some transitional state. The
node may be booting and setting up or it may be shutting down. It is also
possible that the system administrator has manually set the node state to
unavailable to indicate that it should not be used for some other reason.
Nodes also remain unavailable after booting if their hard disks have not been
partitioned. up When a node is tagged as `up', it is available for use by
users.
error A node that does not boot successfully is marked with
`error'. `error' may also indicate an unpartitioned slave node hard disk.
reboot It is possible to `reboot' a node.
halt It is possible to `halt' a node (suspend processing, but
not power off).
pwroff It is possible to remotely power off a node using
`pwroff'.
bpctl program. VMADump is the system used by BProc to take a running process and copy it to
a remote node. VMADump saves or restores a process's memory space to or from a
stream. In the case of BProc, the stream is a TCP socket to the remote machine.
VMADump implements an optimization which greatly reduces the size of the memory
space. Most programs on the system are dynamically linked. At run time, they
will use mmap to get copies of various libraries in their memory
spaces. Since they are demand paged, the entire library is always mapped even if
most of it will never be used. These regions must be included when copying a
process's memory space and again when the process is restored. This is expensive
since the C library dwarfs most programs in size. Here is an example memory
space for the program sleep. This is taken directly from
`/proc/pid/maps'.
08048000-08049000 r-xp 00000000 03:01 288816 /bin/sleep 08049000-0804a000 rw-p 00000000 03:01 288816 /bin/sleep 40000000-40012000 r-xp 00000000 03:01 911381 /lib/ld-2.1.2.so 40012000-40013000 rw-p 00012000 03:01 911381 /lib/ld-2.1.2.so 40017000-40102000 r-xp 00000000 03:01 911434 /lib/libc-2.1.2.so 40102000-40106000 rw-p 000ea000 03:01 911434 /lib/libc-2.1.2.so 40106000-4010a000 rw-p 00000000 00:00 0 bfffe000-c0000000 rwxp fffff000 00:00 0
The total size of the memory space for this trivial program is 1089536 bytes.
All but 32K of that comes from shared libraries - VMADump takes advantage this.
Instead of storing the data contained in each of these regions, it stores a
reference to the regions. When the image is restored, that files will be
mmaped to the same memory locations. In order for this optimization
to work, VMADump must know which files it can expect to find in the location
where they are restored. VMADump has a list of files which it presumes are
present on remote systems. The vmadlib utility exists to manage
this list. Note that VMADump will correctly handle regions mapped with
MAP_PRIVATE, which have been written. VMADump does not specially
handle shared memory regions. A copy of the data within the region will be
included in the dump. No attempt to re-share the region will be made at
restoration time. The process will get a private copy. VMADump does not save or
restore any information about file descriptors. VMADump will only dump a single
thread of a multi-threaded program. There is currently no way to dump a
multi-threaded program in a single dump.
Basic usage information for the binaries that are included with BProc.
bpmaster
bpmaster -hbpmaster -vbpmaster [ -d ]
[ -c c_file ] [ -m m_file ]
bpmaster is the BProc master daemon. It runs on the front-end machine of a
cluster running BProc. It listens on a TCP port and accepts connections from
slave daemons. Configuration information comes from the Beowulf configuration
file. The BProc master daemon reads interface,
iprange, bprocport, allowinsecureports
and logfacility.
Options:
bpslave
bpslave -hbpslave [ -l log ] [ -r ] [ -d ]
[ -m m_file ] [ -v ] masterhostname
port
bpslave is the BProc slave daemon. It runs on slave nodes in a cluster and connects to the front-end machine (masterhostname) to accept jobs through masterport (port).
Options:
bpstat
bpstat -hbpstat -vbpstat
-nbpstat -ubpstat [ -a nodenum
]bpstat [ -s nodenum ]bpstat
-mbpstat -pbpstat -P
bpstat displays various pieces of status information about cluster nodes. The display is formatted in columns specifying node number, node address, node status, user access and group access. This program also includes a number of options intended to be useful for scripts.
Options:
iprange) not the number of nodes that are
up.
bpctl
bpctl -hbpctl -vbpctl -M [ -a ]
commandsbpctl -S nodenum [ -a ] [ -r
dir ] [ -s state ] commands
bpctl is bproc control. Used to apply commands to referenced nodes.
Options:
chroot() to dir.
After doing this, all processes started on a node via BProc will see
dir as their root directory. This command is only usable on slave
nodes.
bpsh
bpsh -hbpsh -vbpsh [-n]
nodenum commandbpsh -a [-n]
commandbpsh -A [-n] command
bpsh is a rsh replacement. It runs command on
nodenum. nodenum may be a single node number, a comma
delimited list of nodes, -a for all nodes which are up or -A for all nodes which
are not down. (Nodes which are not down may be in the error or unavailable
states.)
bpsh will forward standard input, standard output and standard error for the remote processes. Standard output and error are forwarded subject to the flags below. Standard input will be forwarded to the remote processes. If there is more than one remote process, standard input will be duplicated for every remote node.
For a single remote process, the exit status of bpsh will be the exit status of the remote process. Note that this can include non-normal exit status such as being killed by a signal.
When there are many remote processes, bpsh exits with the highest exit status of all the remote processes.
Options:
bpcp
bpcp -hbpcp -vbpcp [ -p ]
f1 f2bpcp [ -r ] [ -p ] f1 ...
fn dir
bpcp copies files between machines. Each file (f1...fn)or directory argument (dir) is either a remote file name of the form node:path, or a local file name (containing no `:' characters).
Options:
vmalib
vmadlib -cvmadlib -a [ libs ...
]vmadlib -d [ libs ... ]vmadlib
-l
This program is a utility to manage the VMADump in-kernel library list.
Options:
bWatch:
After BPROC is installed the installation of bWatch a monitoring utility can be done. Download bWatch-1.0.2.tar. Issue the comand gunzip -c bWatch-1.0.2.tar | tar xovf -. Then in the bWatch-1.0.2 directory check that the path to "wish" is correct. This is usually /usr/bin/ or /usr/local/bin/. This is important since bWatch is a Tcl/Tk script and will not install or run withour "wish". Then type make and make install. After this in the $HOME directory of the root and usr edit the .bWatchrc.tcl so that the entry set listOfhost is as shown below:
set listOfHost {master
node2
node3
node4}
Note if you are using rsh then the .rhosts file in the root and user must use the same names.
Then if you run bWatch.tcl you will see the windows as shown below under GNOME.


This completes the installation of BPROC with bWatch.