---------------------------------------------------------------------------- The Florida SunFlash 4.x Diskless Boot Procedure SunFLASH Vol 23 #16 November 1990 ---------------------------------------------------------------------------- I have received a number of technical articles from Tony Brooks, a Sun engineer. Some articles were written by Tony and others he has collected from other engineers. I will be posting these articles to SunFlash over the next few weeks. I hope that you enjoy them and find them useful. If you have any comments on them, please send them to me and I'll give them to Tony. -johnj ---------------------------------------------------------------------------- The following usage is used throughout the following document: 'ethers', 'hosts', and 'bootparams' are used to refer to /etc/ethers, /etc/hosts, and /etc/bootparams if YP is not running on the server, else they refer to the YP maps if YP is running. To further explain the boot process, I will use the following as an example : Hostname IP_address Hex_IP Ethernet_address -------------------------------------------------------------- Server: batman 129.145.30.15 81911E0F 8:0:20:6:d4:f5 Client: penguin 129.145.30.27 8191271B 8:0:20:7:6:b9 ======================================================================== There are 3 distinct procedures which take place during a 4.x diskless client bootup. A. The RARP Process (Reverse Address Resolution Protocol) During this process the client broadcasts its 48 bit ethernet address to its local network. Any system running rarpd, and also has the client's ethernet address in 'ethers' , will then take the hostname extracted from ethers and lookup that hostname in 'hosts'. If the host is found, that system will return the IP address associated with that host back to the client. The client will then say "Using IP address xxxx = xxxx" (Except on sun4c clients, which are silent at this point) B. The TFTP Process (Trivial File Transfer Protocol) The client now uses the Hexadecimal representation of its IP address to issue a tftp request across the net. The server must have the same Hexadecimal number in its /tftpboot directory as a symbolic link to boot.`arch -k` where `arch -k` is one of {sun3, sun4, sun4c, or sun3x} ie: lrwxrwxrwx 1 root wheel 10 Sep 26 13:31 81911E1B -> boot.sun4c The server must also have the following link in the /tftpboot directory: lrwxrwxrwx 1 root wheel 1 Jul 21 1989 tftpboot -> . The server must also have tftp uncommented in the files; /etc/inetd.conf and /etc/services or YP 'services' map if running YP. Once the client successfully finds the boot file from the server, it downloads it into its local memory. Then the boot prom executes the bootfile just downloaded sending out an RPC bootparam request to the network. C. The Bootparam Process Any system on the same net as the client which is running rpc.bootparamd, and also has 'bootparams' info for the client, will respond to the client which system and path to NFS mount its root and swap file systems from. The client will then attempt to mount its root and swap file systems from the server defined from bootparams. The server must have the following /etc/exports entry for the client's root and swap, and have run 'exportfs -a' /export/root/penguin -root=penguin,access=penguin /export/swap/penguin -root=penguin,access=penguin (This can be verified by running exportfs with no options) 1. What the client console should look like when booting : ----------------------------------------------------------- >b le() Boot: le(0,0,0) Using IP Adress 129.145.30.27 = 81911E1B Booting from tftp server at 129.145.30.15 = 81911E0F action here: Spinning propeller .... -/|\-/|\|/-\| , ( ** or ** incrementing numbers in a box on Sun4c clients ) Downloaded xxxxx bytes from tftp server. Using IP Address 129.145.30.27 = 81911E1B hostname: penguin domainname: gotham server name 'batman' rootpathname '/export/root/penguin' root on batman:/export/root/penguin fstype nfs Boot: vmunix Size: #####+#####+##### ...... ...... ...... ----------------------------------------------------------- 2. The Client<->Server Dialog A. What can go wrong during the RARP stage ? ----------------------------------------------------------- - Any blank lines or trailing spaces on lines in 'ethers' file will cause RARP to fail. ----------------------------------------------------------- - Any leading 0s between colons in 'ethers' will cause RARP to fail. Correct : 8:0:20:7:6:b9 penguin Incorrect: 8:00:20:07:06:b9 penguin ----------------------------------------------------------- - Uppercase hostnames for clients will cause RARP to fail. Lookup in 'ethers' succeeds, but gethostbyname() converts uppercase to lowercase, then looks up the lowercase name in 'hosts' and can't find it. If running YP, the makedbm for hosts.{byname,byaddr} can be modified in /var/yp/Makefile to use the '-l' option to convert uppercase to lowercase. ----------------------------------------------------------- - If nit, pf, nbuf, or clone are commented out of the server's kernel, rarpd will fail to run. The following lines must be included in the server's kernel: pseudo-device snit # streams NIT pseudo-device pf # packet filter pseudo-device nbuf # NIT buffering module pseudo-device clone ----------------------------------------------------------- 2. The Client<->Server Dialog (CONT.) B. What can go wrong during the TFTP stage ? ----------------------------------------------------------- - 'tftp timeout' : This is a common error when tftp is commented out of either /etc/inetd.conf or /etc/services. This error can also occur if the hexadecimal representation of the client's IP address is missing or incorrect in /tftpboot, or if the tftpboot -> . link is missing in /tftpboot. Lowercase hexadecimal characters in the boot file link will also cause this failure ie.: incorrect -> 8191271b instead of correct -> 8191271B. ----------------------------------------------------------- - 'file not found' : tftp is not able to find the boot.`arch -k` file. This is common with 4/60 and 3/80 clients, when setup_exec has not been run for sun4c or sun3x respectively. ----------------------------------------------------------- - panic ......... : well, if setup_client were specified with the wrong architecture (ie, sun4 instead of sun4c) then conceiveably it will probably panic when the boot prom tries to execute the boot file. ----------------------------------------------------------- C. What can go wrong during the Bootparam stage ? ----------------------------------------------------------- - bad dialog with bootparam server : This error is common if there is a 3rd party system on the same network as the client, and that 3rd party machine is also running RPC. The 3rd party system may respond to the client's bootparam request with an RPC_FAILED response back to the client before the real server can respond with RPC_SUCCESS reply. The diskless_boot_hang patch for bug #1018791 will usually fix this problem, however, in cases where is doesn't, it may be necessary for the customer to upgrade their bootprom to revision 3.0 or greater. This patch is also for slow booting problems. ----------------------------------------------------------- - If client's NFS server as listed in bootparams is down or on another network, client will see on the console : hostname: penguin domainname: gotham server name 'batman' / Requesting Ethernet address for 129.145.30.15 = 81911E0F ----------------------------------------------------------- C. What can go wrong during the Bootparam stage ? (Continued) ----------------------------------------------------------- - NFS error 13 : This is an NFS write error message, which usually indicates that the client does not have root access for the indicated file system. This usually happens when the server is exporting /export, as well as /export/root/client and /export/swap/client. If /export is already exported then server will not be able to export subdirectories also. ----------------------------------------------------------- - null hostname returned from bootparam server - or - - null domain returned from bootparam server : These error messages indicate that there is probably a Silcon Graphics system on the same network as the client. If this is the case, have the customer kill the rpc.bootparamd on the Silicon Graphics systems if they are not boot servers. If they are boot servers, then they can anonymous ftp to sgi.sgi.com , cd to sgi/src , change to binary mode and get rpc.bootparamd.Z, or contact Silicon Graphics Support hotline at: 1-800-345-0222 for the SGI bootparamd patch. ----------------------------------------------------------- - clntkudp_callit retries exhausted : This can come from an extremely busy network, or from name lookup problems created by using a libc.so shared library with name resolver routines built-in. ----------------------------------------------------------- - bp_getclntent failed , bp_getclntkey failed : These messages will pop up if the name of the client in 'bootparams' is not the first hostname listed after the IP address in 'hosts'. This usually happens when client is in a nameserver domain. Also be aware of servers which are using the libc.so library with resolver routines built-in. This library bypasses both NIS(YP) and /etc/hosts, and looks only at the nameserver hosts database (ie. hostname=penguin, but entry in 'hosts' is 129.145.30.27 penguin.sun.com penguin # ). ----------------------------------------------------------- C. What can go wrong during the Bootparam stage ? (Continued) ----------------------------------------------------------- - "whoami RPC call failed with status # \ panic: vfs_mountroot: cannot mount root" : There is a bug in a DEC product that causes these symptoms. So, if you have a VMS VAX on your Ethernet AND you are running 4.0 AND you experience the booting problems described here, go find the VAX system manager and ask if he is running something called the "ULTRIX BRIDGE". ULTRIX BRIDGE is DEC's own version of the Wollongong package that allows VMS machines to use REAL datacomm protocols. At any rate, DEC has a patch for this software. ----------------------------------------------------------- 3. SparcStation1 (4/60) anomaly : The default client vmunix will not boot properly. You must remake the client kernel. To remake SS1 client kernel as root : ----------------------------------------------------------- # cd /usr/share/sys/sun4c/conf # cp GENERIC CLIENT # vi CLIENT change => config vmunix swap generic to => config vmunix root on type nfs swap on type nfs # config CLIENT # cd ../CLIENT ; make ; cp vmunix /export/root/penguin/vmunix ----------------------------------------------------------- Reboot the Sparcstation1 client. 3.1 3/80 anomaly : The DL80 config file in 4.0.3 has the following : ----------------------------------------------------------- % more /usr/share/sys/sun4C/conf/DL80 config vmunix root on nfs ----------------------------------------------------------- which should be : ----------------------------------------------------------- % more /usr/share/sys/sun4C/conf/DL80 config vmunix root on type nfs swap on type nfs ----------------------------------------------------------- 4. Booting off of Server's 2nd Ethernet (ie1, le1, ..) NOT SUPPORTED !! Suppose server has 2 interfaces : ie0 = batman ie1 = batman-gw Server must run 'rarpd ie1 batman-gw'. Server's 'bootparams' must look like : ----------------------------------------------------------- % more /etc/bootparams penguin root=batman-gw:/export/root/penguin \ swap=batman-gw:/export/swap/penguin ----------------------------------------------------------- Client's fstab ( /export/root/penguin/etc/fstab ) needs to have: ----------------------------------------------------------- % more /etc/fstab batman-gw:/export/root/penguin / nfs rw 0 0 batman-gw:/export/exec/sun4c /usr nfs ro 0 0 batman-gw:/export/exec/kvm/sun4c /usr/kvm nfs ro 0 0 batman-gw:/export/share /usr/share nfs ro 0 0 batman-gw:/home/penguin /home/penguin nfs rw 0 0 ----------------------------------------------------------- 5. Troubleshooting/Debugging the Client Boot Process A. During RARP stage On the server : ----------------------------------------------------------- % ps ax | grep rarpd { you should see 2 rarpd processes for each interface. make a note of the lowest # PID } % trace -p PID_of_lowest_rarpd { you should see some output when the client broadcasts its ethernet address including something like : open ("/etc/hosts", 0, 0666) = 5 } ----------------------------------------------------------- AND/OR ----------------------------------------------------------- % kill -9 both_rarpd_pids restart rarpds with the debug [-d] option ie.: % rarpd -d if# hostname (4.1 usage : 'rarpd -a -d') ----------------------------------------------------------- On a third system run etherfind : ----------------------------------------------------------- % etherfind -rarp -o -broadcast { This is what you should see from a normal rarp request } Using interface le0 icmp type lnth proto source destination src port dst port 60 rarp old-broadcast old-broadcast ----------------------------------------------------------- B. During the TFTP stage On a third system run etherfind : ----------------------------------------------------------- % etherfind -dstport tftp { This is what you should see from a normal tftp request } { If you don't see this, suspect tftp problems on the server } Using interface le0 icmp type lnth proto source destination src port dst port 65 udp penguin batman 1604 tftp ----------------------------------------------------------- 5. Troubleshooting/Debugging the Client Boot Process (Continued) C. During the Bootparam stage On the server : (Find and kill the rpc.bootparamd process, and restart) ----------------------------------------------------------- % rpc.bootparamd -d { this turns on debug mode } { watch for messages as the client boots } { This is what you should see from a normal bootparam request } Whoami returning name = penguin, router address = 129.145.30.21 ----------------------------------------------------------- On a third system, run etherfind : ----------------------------------------------------------- % etherfind -r -host penguin ----------------------------------------------------------- { This is what you should see from a normal bootparam request } ----------------------------------------------------------- getfile_1: file is "batman" 129.145.30.15 "/export/root/penguin" UDP from penguin.1023 to network.sunrpc 108 bytes RPC Call portmapper PMAPPROC_CALLIT V2 UDP from penguin.1023 to mtnview.sunrpc 108 bytes RPC Call portmapper PMAPPROC_CALLIT V2 60 arp penguin batman UDP from penguin.1022 to batman.641 100 bytes RPC Call prog 100026 proc 2 V1 60 arp penguin batman ----------------------------------------------------------- ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Sunflash is an electronic mail news service from Sun Microsystems, Ft. Lauderdale, Florida, USA. It is targeted at Sun Users and Customers. For additional information about SunFlash send mail to info-sunflash@sunvice.East.Sun.COM SunFlash is distributed via a hierarchy of aliases. Try to address change requests to the owner of the alias that you belong to. If you want to be added to the SunFlash alias, please contact the systems engineers at your local Sun office and/or send mail to sunflash-request@sunvice.East.Sun.COM. All prices, availability, and other statements relating to Sun or third party products are valid in the U.S. only. Please contact your local Sales Representative for details of pricing and product availability in your region. Descriptions of, or references to products or publications within SunFlash does not imply an endorsement of that product or publication by Sun Microsystems. Address comments to the SunFlash editor (John McLaughlin) at sun!sunvice!flash or flash@sunvice.East.Sun.COM. (305) 776-7770.