Note: FAQ Database is still being populated.
FAQs/Howto / PBSPro |
What is PBS?
Association: PBSPro - Posted on: 2005-06-09
PBS is a Portable Batch System. Basically, it is used so that the user can simply sumit their jobs to PBS and not have to worry about finding the right machines on a cluster to run on. It also allows for the administrators of a machine to make sure that a user will not over-use the machine as well. The version of PBS on Fusion is PBSPro version 5.4.2. It may seem like a pain to have to submit jobs to the queue (PBS is essentially also a queueing software), but really, it will make your life easier. PBS will allow you to run jobs as smoothly as possible by giving you the proper resources while keeping all users from breaking policy as much as it can. You also do not need to worry at all about what nodes on the cluster are free or keeping track of which nodes your jobs went to.
How do I create a PBS script for running jobs?
Association: PBSPro - Posted on: 2009-08-21
A PBS script is a script that is used to submit jobs to the queue. There are many possibilities for a PBS script. Here is a simple example of a PBS script for running a serial job on fusion.
#PBS -N jobname
#PBS -l select=1:ncpus=1:mpiprocs=1:mem=100m,walltime=0:20:30
#PBS -S /bin/sh
#PBS -W group_list="short_bio"
#PBS -j oe
#PBS -q short_bio
#PBS -M johndoe@ittc.ku.edu
#PBS -m abe
#PBS -o out.log
#cd to your execution directory
cd ~/programs/code
./a.out parameter1 parameter2
Please note that the "#PBS" prefix is not a comment. These lines are PBS directives that request certain resources and modify the behavior for how the queue handles the job.
#PBS -N jobname - This line tells PBS what the name of the job is. You should change "jobname" to whatever will help you keep track of that job. PBS does not care what your jobname is as long as it only contains numerals and letters.
#PBS -l select=1:ncpus=1:mpiprocs=1:mem=100m,walltime=0:20:30 - This tells PBS now many nodes you are requesting. Because this is a serial (one processor) job, you should only ask for one CPU. The "mem" argument specifies the amount of memory requested for the job with units in KB, MB, or GB (just use suffixes k, m, or g). The "walltime" is the amount of time your job needs in hours:minutes:seconds. The sample script is asking for 20 minutes and 30 seconds.
#PBS -s /bin/sh - This will tell PBS what shell to use. If your program requires bash, you should use bash. If it requires tcsh, use tcsh. By default, sh is a safe one to use if you are not sure what you are doing. Note that when you use the particular shells, the rc files (ie. tcshrc and bashrc) DO get sourced. This will allow you to add any code inside of those scripts for your job as necessary.
#PBS -W group_list="short_ittc" - This will tell PBS what group to use when running a job. All queues, except the interactive queue, have access control lists. These lists are based on the unix/linux group. The name of the queue is also the group name you will need to use.
#PBS -j oe - The -j command will join the output files. The "oe" means to join the output and error files together. This is good for people who would like to know where the errors happened with respects to the output. If you would like seperate files for this, you may omit this line.
#PBS -q short_ittc - This is the line where you tell PBS which queue you would like to use. If you would like to use the long queue, you would exchange the word "short" with "long". If this line is not included, it will default to using the "short" queue.
#PBS -M johndoe@ittc.ku.edu - You should use this line if you would like emails when your job does something. The next command allows you to control when you get emails. If you do not want emails sent to you to notify you of the status of your job, you may omit this and the next line.
#PBS -m abe - This tells PBS when it should email you. The "a" stands for abort. This will email you when your job gets killed by either you or by the superuser. The "b" command emails you when your job begins. Sometimes, Fusion is full when you submit your jobs and this will notify you when your job starts. Finally, the "e" command will have PBS email you when your job exits for any reason. This includes when your job ends successfully or dies for any reason.
#PBS -o out.log - This line is the name of your output file. If you have a program that will display an output to the screen, it will get included into this output file.
Everything under the #PBS prefixed commands are for you. You can put anything there you would like and it will run just as it would in a shell script. The shell you chose to use is the language your shell script should be in. Also, in this case, the "#cd ..." is a comment.
Running an MPI or other parallel job requires a PBS script pretty much like the one for the serial job. Here is a sample PBS script for an MPI job.
#PBS -N mpitesting
#PBS -l select=4:ncpus=2:mpiprocs=2:mem=500m,walltime=0:20:30
#PBS -S /bin/sh
#PBS -q short_ittc
#PBS -W group_list="short_ittc"
#PBS -M johndoe@ittc.ku.edu
#PBS -m abe
#PBS -j oe
#PBS -o out.log
cd ~/mpitest
mpirun -np 8 -hf $PBS_NODEFILE /bio/users/
As mentioned before, the script for a parallel job looks very similar to that of a serial job. The main difference here is the following line:
#PBS -l select=4:ncpus=2:mpiprocs=2:mem=500m,walltime=0:20:30 - This line has the number of nodes changed to 4. ncpus and mpiprocs should have the same parameter, for the number of processors per node. This line also requests 500m per node. There is a maximum total number of processors which can be requested, which is the number of nodes times the processors per node i.e. 4x2=8 processors.
For a parallel job, a good thing to know is that you can get a list of the nodes you are running on by typing "cat $PBS_NODEFILE". You can redirect this into a file and go from there if necessary.
To run a two processor SMP job, you should ask for a job in exclusive mode. Just add the following to your PBS script.
#PBS -l select=1:ncpus=2:mpiprocs=2:mem=500m,walltime=0:20:30
This will give you two processors on one node, with 500 MB of memory.
#PBS -l select=2:ncpus=1:mpiprocs=1:mem=500m,walltime=0:20:30
This will assign 2 processors to your job, even if they are not on the same node. The total memory assigned is 1000 MB.
If you would like an 8 processor job on four nodes for any reason (such as running on the interactive queue), you can use the following.
#PBS -l select=4:ncpus=2:mpiprocs=2,walltime=0:20:30
Notice that the nodes are now four and the processors per node is still two.
How do I submit a job?
Association: PBSPro - Posted on: 2005-06-09
To submit a job, create a PBS script and submit it to the queue by typing "qsub PBS_script" where PBS_script is the name of the PBS script you created.
My job needs user intervention. Is there an interactive mode for PBS?
Association: PBSPro - Posted on: 2005-06-09
Yes. You can submit a job by using the "-I" flag, "qsub -I PBS_script" and you will be logged onto the first node that will run your job. Then, you can run anything you need to to run your job and put in inputs as necessary.
What queues are available to me?
Association: PBSPro - Posted on: 2006-02-13
There are more queues than just short_bio. At the present, the regular queues can be organized by there lengths; short, medium, long, max. If you need special requirements, contact the system administrator and he or she may be able to negotiate getting a custom queue set up for your purposes.
The full list of queues can be found here:
Queues
Currently all queue's have ACL's (access control lists) except interactive which anyone can use. Contact biohelp@ittc.ku.edu to be added to any queue's other then those assigned.
For more information on a certain queue, type:
qmgr -c "print queue queuename"
The queuename in the above command is the name of the queue you are wanting more information on.
Now that I've submitted my job, how do I check the status?
Association: PBSPro - Posted on: 2006-02-13
To check a job submitted to PBS, try using the command, "qstat -u username" where "username" is your username on Fusion. You will get information about your job in the following order.
| Job ID | Username | Queue | Jobname | SessID | NDS | TSK | Req'd Memory | Req'd Time | S | Elap Time |
| 700.fusion | johndoe | short_bio | mpitesting | 2365 | 2 | 4 | 512mb | 10:00 | R | 5:30 |
I don't see any output from my job. What's going on?
Association: PBSPro - Posted on: 2005-06-09
Some jobs will not output information into your output files until your job is complete. However, PBS does keep this info in a file. To see this file, add the following into your PBS script.
OUT=$(echo $PBS_JOBID | cut -f 1 -d .)
tail -f /var/spool/PBS/spool/$OUT.prair.OU >> ~/output.log&
How do I kill a job?
Association: PBSPro - Posted on: 2005-06-09
Killing a job is simple. Just type "qdel JobID". The JobID is the PBS Job ID as mentioend in the sample above. If this does not work, you can use "qdel -Wforce JobID".
My job was killed. What should I do if I don't know why?
Association: PBSPro - Posted on: 2005-06-09
Take a look at the error log. If you "joined" it with the output and error files, it will be "jobname.JobID.o, or whatever you decided to name your output file as if using "#PBS -o output.txt". If you didn't join the output and error files, it will have the suffix, "*.e". Take a look at those files and see if you can figure anything out. If not, please email the errors to biohelp@ittc.ku.edu for assistance.
Why shouldn't I just ask for the maximum amount of resources possible at all times?
Association: PBSPro - Posted on: 2005-06-09
It almost sounds like a good idea to hog up all the resources you can, but it's actually to your disadvantage to do so. PBS tries to load-balance the jobs, and so if it sees that it can run a smaller job before a larger job will be able to run, it actually will run the smaller job in order to use as much resources as possible. So, if your job only needs 10 hours of runtime on 6 CPU and you ask for 72 hours, PBS will run a job that will run a job that needs 4 CPU and 5 hours if 6 CPUs will not be available for 6 hours. If you see a job running after you have submitted a job to queue, this is also probably the reason why. Also, using exclusive mode sounds like a good idea, but please also note that the chances of getting a node with nothing else running on it is not as frequent as the chances of getting two nodes with one CPU open. In general, ONLY ASKING FOR THE AMOUNT OF RESOURCES AS YOU NEED AND NOT MORE HELPS EVERYBODY INCLUDING YOU.
Is there a user's manual available?
Association: PBSPro - Posted on: 2005-06-09
Associated File: PBS User Guide
Yes, the PBS user's manual is available it can be found with this faq. You can also use the man pages. There are a lot more possible commands that you can use with PBS. I would add them here, but there is a reason why the user's manual is very very long. The user's manual will help you use them.
What do I need to run through PBS and what can I run on the head node?
Association: PBSPro - Posted on: 2005-06-09
In general, all jobs need to go through PBS. You can compile, compress/uncompress, untar/tar, edit, and look through directories on the headnode. Of course, you can also monitor jobs and do other everyday use commands on the headnode as well. Other commands than what is listed above that may take up resources should be run through PBS. If you have any questions, ask support for assistance.
Why do I have to put ".fusion" on the end of my qdel, qhold ... commands, when in the past I did not have to ?
Association: PBSPro - Posted on: 2006-06-21
PBS is setup so that you can execute those commands from pretty much anywhere in the world, as long as the machine executing those commands has access to the PBS server. With this in mind you can access multiple PBS servers from a single machine. So for PBS to know which server you want to talk with the extension ".fusion" (for the Bioinformatics cluster at ITTC) is needed to route the request to the correct server. PBS defaults to the local machine, which is why you did not need the extension while logged into fusion.
How do I submit a job that uses InfiniBand?
Association: PBSPro - Posted on: 2009-08-26
A job using InfiniBand has to run on a set of nodes sharing the same InfiniBand switch. On the Bioinformatics cluster, this is done by setting a flag in your PBSPro submit script.
First, please read the PBSPro job submission FAQ to learn how to submit jobs to the cluster. The -l flag sets the node requirements for your job. Your script must use the following syntax:
-l select=(# of nodes):ncpus=(# cpus per node):ibc4=True
Here, ibc4 is a custom resource defined as True, only on rack 4. This insures that only nodes from rack 4 will be used to fill the job.
The custom resources, ibc5, ibc6, ibc7, and ibc8, are similarly defined to allow jobs to run only on racks 5-8. Users must choose on which rack to run each job and set the value of ibc4, ibc5, ibc6, ibc7, or ibc8 to True accordingly.
Copyright © 2008 by the University of Kansas
Please send comments and questions to the webmaster.
