Patch-ID# 113900-04 Keywords: security qstat qhost scheduler execd pe jobs qacct i18n l10n ssl Synopsis: Sun Grid Engine, Enterprise Edition 5.3 Linux: maint./security patch Date: Apr/07/2004 Install Requirements: Additional instructions may be listed below Solaris Release: SunOS Release: Unbundled Product: Sun Grid Engine Enterprise Edition Unbundled Release: 5.3 Xref: Topic: Relevant Architectures: i386 BugId's fixed with this patch: 4713013 4749151 4756556 4756557 4760981 4769608 4775325 4776016 4776754 4776821 4778757 4778758 4778762 4780316 4787598 4787623 4790540 4790547 4790592 4791238 4791908 4792036 4794242 4795475 4802171 4802831 4805423 4807677 4811230 4813188 4813965 4815774 4815795 4816529 4816541 4818737 4818741 4819479 4822742 4822746 4822799 4824104 4833346 4835832 4838549 4838595 4838636 4838650 4841414 4842844 4842878 4844838 4845505 4847814 4847819 4851939 4860391 4866711 4869772 4869784 4876169 4881949 4883714 4885719 4885906 4885930 4886017 4886025 4886026 4893432 4930786 4930789 4930793 4949917 4952236 4952767 4957760 4969825 5018669 5018695 5018726 5018733 5018884 5019595 5019601 5019624 5019635 5020131 5020134 5020139 5020141 5020143 5020153 5020278 5020371 5021405 Changes incorporated in this version: 4969825 5018669 5018695 5018726 5018733 5018884 5019595 5019601 5019624 5019635 5020131 5020134 5020139 5020141 5020143 5020153 5020278 5020371 5021405 Patches accumulated and obsoleted by this patch: Patches which conflict with this patch: Patches required with this patch: Obsoleted by: Files included with this patch: /bin/glinux/qacct /bin/glinux/qalter /bin/glinux/qconf /bin/glinux/qdel /bin/glinux/qhost /bin/glinux/qmake /bin/glinux/qmod /bin/glinux/qmon /bin/glinux/qsh /bin/glinux/qstat /bin/glinux/qsub /bin/glinux/qtcsh /bin/glinux/sge_commd /bin/glinux/sge_coshepherd /bin/glinux/sge_execd /bin/glinux/sge_qmaster /bin/glinux/sge_schedd /bin/glinux/sge_shadowd /bin/glinux/sge_shepherd /bin/glinux/sgecommdcntl /lib/glinux/libXltree.so /utilbin/glinux/adminrun /utilbin/glinux/checkprog /utilbin/glinux/checkuser /utilbin/glinux/filestat /utilbin/glinux/gethostbyaddr /utilbin/glinux/gethostbyname /utilbin/glinux/gethostname /utilbin/glinux/getservbyname /utilbin/glinux/infotext /utilbin/glinux/loadcheck /utilbin/glinux/now /utilbin/glinux/openssl /utilbin/glinux/qrsh_starter /utilbin/glinux/rlogin /utilbin/glinux/rsh /utilbin/glinux/rshd /utilbin/glinux/testsuidroot /utilbin/glinux/uidgid Problem Description: 5021405 CSP reconnect problem of scheduler and execd 5020371 sge_shepherd creates world writable files 5020278 a colon in a job name breaks qacct 5020153 mail bomb upon abort with tightly integrated par jobs 5020143 qdel XXX.YY- will delete the first array task of job XXX 5020141 qsh and qlogin accepted the options -h and -hold_jid and ignored them later 5020139 a stored job template in qmon sets -hold_jid to a wrong value 5020134 qhost output broken for global consumables 5020131 renaming a user deletes the user 5019635 schedd_job_info=true causes large delays with parallel job scheduling 5019624 qselect/qstat -l selection wrongly considers load and utilization 5019601 "vmem" in qstat -j keeps the max value 5019595 Dateformat YYMMDDhhmm was interpreted wrong (qacct, qsub, qalter,...) 5018884 SSL vulnerabilities stated in Sun Alert 57524 5018733 Empty parameters crashes qstat and qhost 5018726 qalter lacks -dl option! 5018695 loadsensor doing output to stderr can block 5018669 qrsh/qlogin: "Connection refused" due to race condition in shepherd 4969825 not supported array task dependencies are not rejected (from 113900-03) 4930786 global load values are ignored 4930789 An overwritten string attribut was ignored in the scheduler 4930793 minor issues with the sgeee ticket update interval 4949917 qmon seg faults with a user hold job from qtcsh qtask file 4952236 Broken mail option with SGE 5.3p4 qrsh 4952767 qrsh -notify doesn't work 4957760 Fix needed for CERT CA-2003-26 Multiple Vulnerabilities in SSL/TLS (from 113900-02) 4749151 Adding user to CSP secure system fails on S2.6, 7, 8. 4775325 wrong qstat -j diagnosis message indicates not enough PE slots 4813965 tightly integrated parallel array jobs do not work 4815795 qstat -alarm broken 4819479 qhost -q -l arch=xx crashes if a grid execution host is down 4822742 SGEEE: deleted sharetree will show up again after qmaster restart 4822746 SGEEE: spooling integrity improvement missing for sharetree, projects and users 4822799 SGE(EE) cannot be installed on Solaris 10 4824104 SGEEE: qalter -p for running jobs 4833346 qsh/qrsh/qlogin might core with segementation violation 4835832 NOTIFY_SUSP signal only sent for first suspension of job 4838549 maxujobs scheduler config functionality is broken 4838595 maxujobs does not count jobs with certain state 4838636 sharetree can't be modified 4838650 Array job tasks may set queue in error state when started 4841414 Unable to delete task array job with negative increment 4842844 some jobs may stay long time in transferring state for hosts with many slots 4842878 qdel -u does not delete all jobs of the user 4844838 sge_shepherd does not exit on SIGTERM 4845505 cannot qalter/qhold/qrls several tasks of same job 4847814 jobs in rescheduling state are not scheduled due to wrong ticket calculation 4847819 util/sge_update.sh fails to upgrade sge to sgeee 4851939 qmon->job control->pending jobs->Why? fails if not enough free slots in pe 4860391 qmake dumps core when starting recursive make calls 4866711 SGE_O_* variables incorrectly set for tasks of tighly integrated jobs 4869772 Linux limits > 2GB are not set correctly 4869784 qmon "Qmon" resource file contains syntax errors 4876169 qrsh -l =1 -now no cause sge_qmaster to crash 4881949 Parallel jobs exceeding wall clock time are not killed 4883714 SGEEE: qmaster crashes on qdel of tightly integrated parallel job w. usage_scal. 4885719 during installation error message about unset SGE_ROOT is printed 4885906 NSLOTS and NHOSTS incorrectly set in environment of tightly integrated tasks 4885930 failure of master task of a tightly integrated parallel job does not delete job 4886017 qstat -r -s z command aborts 4886025 queuenames in qstat and in gui need more characters 4886026 max_u_jobs settings rejects submission though limit not reached 4893432 upgrade to openssl 0.9.7a (from 113900-01) 4713013 qacct may display incorrect accounting information 4756556 .cshrc error causes [pro|epi]log,pe-[start|stop] failure 4756557 non-resolvable hosts in host_aliases file cause wrong hostname resolving 4760981 Empty sge_request file causes submission error 4769608 qalter shows wrong priority number when using negative priorities with -p option 4776016 execd does res. consuming process tracking even if no job is to be controlled 4776754 complex values for user defined complexes are rejected with global host 4776821 qtcsh can't be used as normal tcsh 4778757 stepsize 0 in array job specification results in qmaster exception 4778758 memcpy leak in execd 4778762 Array jobs which contain only one task (id=1) will be handled as single job 4780316 race condition if signals are to be delivered in job's startup phase 4787598 schedd_job_info messages shown by qstat -j even if it is set to false 4787623 failover to shadow master leaves sge_schedd on the original qmaster host 4790540 sge_schedd process consumes more memory than needed if schedd_job_info=true 4790547 Job notification signals won't be delivered if user redefines suspend_method ... 4790592 conflicting policies can cause job being started and immediately suspended 4791238 SGE may create duplicate accounting entries for parallel jobs 4791908 job logging file exists but is empty in certain configurations 4792036 job arguments larger than 10k crash qmaster 4794242 wrong usage reported by qstat -j 4795475 qstat -f output broken for pe jobs on same queue 4802171 qacct -l selection broken 4802831 cannot set -C to null string as described in man qsub 4805423 STRING complex attribute handling with RELOP "!=" is broken 4807677 qrsh crash when command line arguments are longer than 4K 4811230 qconf -Muser and qconf -Auser report no success messages 4813188 qstat -r shows wrong dependencies 4815774 Uninitialized pointer cause segmentation fault in qsh/qrsh on submit only hosts 4816529 qmon crash when pressing Why for a list of selected jobs 4816541 no newline character at end of sge_aliases file may crash qsub 4818737 SGEEE: huge scheduling times if maxujobs is set 4818741 startup failure of qrsh job is reported as regular job exit Patch Installation Instructions: -------------------------------- Special Install Instructions: ----------------------------- Important note if Sun Grid Engine has been installed with openSSL support ------------------------------------------------------------------------- If Sun Grid Engine has been installed with openSSL support ("CSP mode") prior to SGEEE 5.3p3 (which was linked with openSSL 0.9.6.c), the certificates which have been installed with these versions are incompatible with certificates installed with SGEEE 5.3p4 or later. All such certificates will need to be recreated after installing this patch and before restarting Sun Grid Engine. Please refer to the Sun Grid Engine Administration and User Manual for how to create new certificates with the utility script "sge_ca", which comes with the distribution. The reason for the incompatibility is a changed field name between openSSL version 0.9.6 and 0.9.7 in the certificates, where "uniqueIdentifier" has been renamed to "userId". Note for bug id 5020371 ("sge_shepherd creates world writable files") --------------------------------------------------------------------- If the execution daemon spool directory is located on NFS and the execution host machine does not have read/write permissions for user root (which is often the case due to security reasons) the shepherd process will continue to create some of the files in its job directory with world writable permissions. If the NFS client has write permissions the fix will be effective without further changes after patch installation. To make the fix effective it is required to install the execution daemon spool directory on a local file system. Also for performance reasons it is recommended to install the execution daemon spool directory on a local file system. 1. Changing the execution daemon spool directory for all hosts simultaneously - there may be no running jobs in the cluster - shut down qmaster - shutdown all execution daemons - edit the global cluster configuration file //common/configuration and change the path to the configuration value execd_spool_dir - restart qmaster - restart your execution daemons 2. Changing the execution daemon spool directory for each execution host individually: - no jobs may be running on the execution host where the spool directory is going to be changed - edit the local configuration for this execution host: % qconf -mconf and add the local spool directory: execd_spool_dir - shutdown and restart the execution daemons In addition to these notes please read the full "Special Install Instructions" section later in this file about requirements when the patch itself can be installed. tar.gz Patch Installation: -------------------------- This patch in 'tar.gz' format cannot be installed with 'patchadd' on Solaris systems. The patch is installed by unpacking the 'tar.gz' file(s) in this directory in . is usually your directory. The installation of this patch later is not visible with the "showrev -p" command on Solaris. This patch later cannot be backed out. You may make a backup copy of the files which will be overwritten when this patch is installed. Please read "Install Instructions" later in this file and carry out all steps before you unpack the 'tar.gz' file(s) included in this patch. This patch in 'tar.gz' format may not be installed if the original package has been installed with 'pkgadd' on Solaris. In this case please install the available patches for Sun Grid Engine, Enterprise Edition from http://sunsolve.sun.com in 'pkgadd' format. The patch is installed by user root by unpacking the file(s) in the directory where the original package has been installed: # cd # gzip -dc / | tar xvpf - After installing the patch you should correct the file permissions if your Sun Grid Engine installation is installed as an "admin user" system: # cd # util/setfileperm.sh where is the username of the "admin_user" of your global cluster configuration and is the group which you set during your initial installation for the files of your Sun Grid Engine distribution. Install Instructions: --------------------- These installation instructions assume that you are running a homogenous Sun Grid Engine cluster where all hosts share the same directory for the binaries. If you are running Sun Grid Engine in a heterogenous environment (mix of 32-bit and 64-bit binaries for Solaris and/or other operating systems) it is only necessary to shutdown the daemons for the architecture for which the patch is applied. If you installed the binaries on a local partition, you only need to stop the Sun Grid Engine daemons for that host on which you are installing the patch. By default there may by no running jobs when the patch is installed. There may pending batch jobs, but no pending interactive jobs (qrsh, qmake, qsh, qtcsh). It is possible to install the patch with running batch jobs. To avoid a failure of the active "sge_shepherd" binary it is necessary to move the old shepherd binary (and copy it back prior the installation of the patch). In no case it is supported to install the patch with running interactive jobs, 'qmake' jobs or with running parallel jobs which use the tight integration support (control_slaves=true in PE configuration is set). Stopping the Sun Grid Engine cluster to start jobs -------------------------------------------------- Disable all queues that no new jobs are started: # qmod -d '*' Optional (only needed if there are running jobs which should continue to run when the patch is installed): # cd $SGE_ROOT/bin # mv /sge_shepherd /sge_shepherd.sge53 It is important that the binary is moved with the "mv" command. It may not be copied because this could cause a crash of an active shepherd process of a running job when the patch is installed. Shutting down Sun Grid Engine qmaster and scheduler --------------------------------------------------- You need to shutdown (and restart) the qmaster and scheduler daemon and all execution daemons on all Sun Grid Engine hosts. Shutdown all your execution hosts. Login to all your execution hosts and stop the 'sge_execd' and 'sge_commd': # /etc/init.d/rcsge stop Then login to your qmaster machine and stop 'sge_qmaster', 'sge_schedd', 'sge_commd' and if the machine is also an execution host 'sge_execd' # /etc/init.d/rcsge stop Now verify with the 'ps' command that all Sun Grid Engine daemons on all hosts are stopped. If you decided to rename the shepherd binary that running patch job continue to run during the patch installation you may not kill the 'sge_shepherd' binary. Installing the patch and restarting Sun Grid Engine --------------------------------------------------- Now please install the patch by unpacking the 'tar.gz' files included in this patch as outline above. After installing the patch you need to restart your cluster. Please login to your qmaster machine and enter: # /etc/init.d/rcsge Now you should repeat this step on all your execution hosts. After restarting Sun Grid Engine you may again enable your queues: # qmod -e '*' If you renamed the shepherd binary you may safely delete the old binary when all jobs finished which where running prior the patch installation. README -- Last modified date: Wednesday, April 7, 2004