Batch Processing

From TAMUQ Research Computing User Documentation Wiki
Jump to navigation Jump to search


What is Batch Processing?

“Batch processing” is the execution of a series of programs (also referred to as “jobs”) on a computer without manual intervention.

We could say that the normal way of running programs on any computer is by directly launching them using whatever mechanism the operating system provides. For instance, in windows, one would double click a program icon on the desktop, or select the relevant menu item from “Start --> All Programs”. Likewise, on a linux system, one could type the name of the program’s executable file at the command prompt and hit enter. In the world of high performance computing, we often refer to this paradigm as “interactive processing”. In other words, the user manually launches a program and then interacts with it directly for the duration of its run‐time (with mouse clicks, keystrokes, etc).

In contrast to this, the alternative batch processing paradigm automates an entire workflow by creating a set of instructions that tell the computer which program to run, where the program should find the required input data it needs, and where it should store its results. These marching orders can then be handed off to an intermediate software subsystem, essentially a middleman, which takes care of starting the program, feeding it input, collecting output, and so on.

Why Use It?

There are many benefits to employing the batch processing paradigm. The most important of these is that it allows the effective sharing of computing resources among many users and programs. On busy, shared systems, batch jobs can be queued to run as soon as resources become available. A good analogy for this is the use of “take‐a‐number” systems at banks where newly arriving customers pull a numbered ticket from a dispensing machine and use it to determine their turn to obtain service. Batch systems can implement very simple to very complex management‐defined “policies” based on which they assign priorities to incoming jobs and define limits on who can use how much of a shared system, at what times, for how long… and so on.

Another benefit of batch processing is that it promotes efficient use of computing resources. Consider how much time is wasted in manual human interaction with software, during which the CPU and other resources lie idle. Computers, if they can be told exactly what to do (with batch jobs, for instance), can work faster by several orders of magnitude, improving utilization of expensive computational resources and maximizing the system owner’s return on investment.

Batch Schedulers & Queues

Batch system software has been implemented by many vendors over the years. Among the popular implementations in the unix and linux space are products such as PBS, LSF, LoadLeveler, SGE, and SLURM. Our site uses the “PBS Professional” product, where PBS is short for Portable Batch System.

Batch systems such as PBS are also often referred to as “batch schedulers” or “batch queuing systems”, which brings us to the concept of a queue. In order to regulate the flow of jobs through the system, batch schedulers allow administrators to define logical constructs called queues, in which incoming jobs are queued, awaiting their turn to execute.

Each queue is configured with a set of attributes such as a name, queue resource limits, and job run limits. In this sense, queues may also be conceptualized as containers for different “classes” of jobs. For instance, certain jobs may require huge amounts of CPU time but not much in the way of memory. In contrast, other jobs could require large amounts of memory, but only run for short periods of time. Queues could therefore serve as a way of segregating different classes of workload to improve the overall throughput of jobs in the system. Again, by way of analogy, we can imagine the checkout queues at a supermarket. Shoppers with fewer than 10 items generally benefit by having separate express queues so they can avoid the longer wait times in the regular queues.

How the User Communicates with PBS

One of the fundamental roles of PBS is that it constantly monitors all compute nodes within the system and keeps a tab on how busy the system is. It knows which nodes are available for new jobs, how much memory and how many CPUs are currently being used on the busy nodes, and so on. Consequently, PBS is in the best position of allocating the required resources to new incoming user jobs. But how can the user tell it what s(he) needs and how can s(he) get PBS to launch the required program?

This may be accomplished by creating a PBS job file, a plain text file that can be created in any text editor on the system. Using a defined syntax, the user can ask for a certain amount of resources to run their workload. By resources, we mean primarily the following three things:

  1. amount of time the user needs access to the CPU(s) – this resource is known as “walltime”
  2. the number of CPUs needed – this is the “ncpus” resource (if not explicitly requested, the default value is 1)
  3. amount of memory the user’s program will require – this is the “mem” resource

In some cases, users would also specify in the job file the name of the PBS queue to which they wish to submit their job (e.g. “graphics”, “fat”, etc.). Normally, however, PBS automatically routes a user job to an appropriate queue based on amounts of various resources requested by that particular job. More will be mentioned about this issue below.

In addition to requests for resources, the job file typically also contains user commands, such as the command to launch the program needed by the user.

Creating a PBS Job File

Let’s take a look at a sample job file that runs a program called myProg.exe. Note that we may interchangeably refer to this as the job file or the job script in the discussion below.

#PBS –N sillyJob
#PBS -l walltime=7:30:00
#PBS -l mem=8gb
#PBS –l ncpus=1
#PBS –q fat
#PBS -j oe

# This line is a comment, ignored by PBS

cd $PBS_O_WORKDIR
echo "My job is about to start..."

./myProg.exe

echo "...my job has now finished."

Let us now dissect this file and understand its meaning.

PBS job files begin with several PBS directives. Directives are lines that begin with the string #PBS followed by a “switch” (i.e. –N, -l, -q, etc.) and usually followed by values. The directives in our sample file can be understood as follows:

#PBS –N sillyJob This directive assigns a label to your job (in this example, the string “sillyJob”). This is useful while your job is running and you need to monitor it with the qstat -a command. It can be indispensable if you have multiple jobs running at the same time and you need a way to distinguish among them in order to track their progress. Labels should be both short and meaningful. Only the first 10 characters of the label are displayed in the output of the qstat -a command that monitors currently running jobs. The label may not be longer than 15 characters as this will lead to the job being rejected by PBS.
#PBS -l walltime=7:30:00 This directive specifies the maximum amount of time which a job is requesting for its execution. In this example, if the job continues to run for longer than the 7 hours and 30 minutes it requested, PBS will forcibly terminate it for exceeding its walltime limit. As users become familiar with their typical workloads, they should only request walltime values that comfortably satisfy their needs, and abstain from using extravagantly large values (“extravagant” in relation to what they actually require).
#PBS -l mem=8gb This directive specifies the maximum amount of physical memory requested by the job (in this example, 8 gigabytes). Do not request more than 62gb of memory unless you are submitting jobs to the “fat” queue, where up to 254gb is possible. If you do not explicitly request a certain amount of memory, a default value will be assigned (for most queues that would be 2gb).
#PBS -l ncpus=1 If your job is confined to a single node, but can use multiple CPUs, you may use this directive to request the required numbers of CPUs. On raad, this value can be as large as 16 for any given compute node (or 32 if you are submitting the job to the fat queue). If you do not explicitly request cpus, the default value of 1 will be assigned. In our example file, we therefore did not really need to request ncpus=1 (we could have omitted this directive from the file altogether and received a default of 1 anyway).
#PBS -q fat This specifies the name of the PBS queue a job is requesting to be queued in. The directive must ONLY be used by jobs intending to run in the “fat” or “graphics” queues, which send jobs to special nodes rather than to the regular compute nodes. For all other jobs, the -q directive MUST NOT appear in the job file.
#PBS -j oe Normally, when a command runs it prints its output to the screen. This output often includes both normal (standard) output as well as error output. Within a batch job, because the screen will not be available to the user’s program, the output must be stored in a file instead. This directive tells PBS to place both types of output into the same output file, which the user may examine once the job has finished. The name of this file will be generated by PBS automatically (more on this below).

Now that we’re done looking at the directives let us examine the commands that follow the directives in our sample job.

Note that any text to the right of a single “#” character on its own (without an attached “PBS”) is interpreted as a comment, and ignored by PBS when the job script is executed. Such is the case with the first non‐empty line we encounter in our sample job script after the PBS directives.

The cd $PBS_O_WORKDIR command instructs the job script to change its current working directory to the one specified in the value of the PBS_O_WORKDIR variable. This is one of several special variables created dynamically by PBS (and separately for every submitted job) whose values are available to the script running the job once job execution starts. Specifically, the PBS_O_WORKDIR variable contains the path of the directory from which the user submitted her job. If we do not cd to this directory in our job script, the default working directory for the job is always the submitting user’s home directory, e.g. /panfs/vol/f/fachaud74.

The common practice is for users to create a separate sub‐directory for every simulation that needs to run, placing all required files (including the PBS job script) and data for that simulation in that directory, and launching the PBS job from that directory. When this practice is followed, it is necessary to cd to $PBS_O_WORKDIR within the job script.

The echo command (both before and after the line launching the myProg.exe executable) is simply a kind of print statement to print the specified strings to the screen as an indication that the program is about to start and then later that it has finished. These are obviously optional and have no bearing on the actual work performed by the job.

Submitting and Monitoring a PBS Job

Assume we saved the text in the preceding section into a file called “silly.job”. All we now need to do to hand this job over to PBS (i.e. to “submit” it) is to issue the following command:

qsub silly.job

Note that when a qsub command is issued, PBS prints a job ID to the terminal before the next command prompt appears. The job ID looks something like the string “30665.raad‐mgmt” on our HPC Cluster.

If we wanted to monitor the progress of this job on the system, we would issue the “qstat –a” command and in response see the following kind of output:

[fachaud74@raad ~]$ qstat -a 

raad-mgmt: 

                                                            Req'd  Req'd   Elap 
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time 
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
11901.raad-mgmt faelmel4 par_extr HSE_45_Tio  82272   1 16  47gb   336:0 R 320:2 
11904.raad-mgmt faelmel4 par_extr HSE_25_Tio  31634   1 16  47gb   336:0 R 320:1 
13653.raad-mgmt faelmel4 par_extr HSe_35_Tio  39266   1 16  47gb   336:0 R 152:2 
13655.raad-mgmt faelmel4 par_long 35_Tio      28973   1 16  47gb   168:0 R 151:5 
13656.raad-mgmt faelmel4 par_long 45_Tio       7879   1 16  47gb   168:0 R 151:5 
13796.raad-mgmt wacheng7 par_extr Suqoor-L-1 117959   1  8  30gb   250:0 R 123:2 
14729.raad-mgmt fahasan1 par_long Hasan-ip-C  91038   1 16  48gb   168:0 R 51:51
14767.raad-mgmt fahasan1 par_long Ph-tBu-c2    3899   1 16  48gb   168:0 R 30:22
14768.raad-mgmt fahasan1 par_long Ph-tBu-c2     630   1 16  48gb   168:0 R 30:22
14769.raad-mgmt fahasan1 par_long Ph-tBu-c2+ 105807   1 16  48gb   168:0 R 29:56
14770.raad-mgmt fahasan1 par_long Ph-tBu-c2+  97595   1 16  48gb   168:0 R 29:56
14774.raad-mgmt fahasan1 par_long Ph-tBu-c2+  19970   1 16  48gb   168:0 R 28:09
14775.raad-mgmt fahasan1 par_long Ph-tBu-c2+  39062   1 16  48gb   168:0 R 28:09
14776.raad-mgmt fahasan1 par_long Ph-tBu-c2+ 117429   1 16  48gb   168:0 R 28:08
14777.raad-mgmt fahasan1 par_long Ph-tBu-c2    7761   1 16  48gb   168:0 R 28:08
14778.raad-mgmt fahasan1 par_long Ph-tBu-c2+   7347   1 16  48gb   168:0 R 28:08
14790.raad-mgmt kokakos1 ser_long Post        26896   1  1   4gb   24:00 R 21:26
14797.raad-mgmt otmoult3 par_long submiscrip  95417   1 16   1gb   99:00 R 06:29
14798.raad-mgmt otmoult3 par_long submiscrip 100430   1 16   1gb   99:00 R 06:27
14799.raad-mgmt otmoult3 par_long submiscrip  18206   1 16   1gb   99:00 R 06:24
14800.raad-mgmt otmoult3 par_long submiscrip  34722   1 16   1gb   99:00 R 06:21
14802.raad-mgmt otmoult3 par_long submiscrip  77991   1 16   1gb   99:00 R 06:01
14803.raad-mgmt faelmel4 par_long hse_bare_S 108792   1 16  47gb   168:0 R 05:59
14804.raad-mgmt otmoult3 par_long submiscrip 107261   1 16   1gb   99:00 R 06:00
14805.raad-mgmt otmoult3 par_long submiscrip  18237   1 16   1gb   99:00 R 05:58
14806.raad-mgmt fachaud7 fat      sillyJob    49417   1  1   8gb    7:30 R 00:01

[fachaud74@raad ~]$

Notice that the output shows all jobs running on the system, with our sample job at the end of the listing. The “Jobname” column lists our job as “sillyJob” the name we gave it with the #PBS –N directive. The TSK column shows the number of CPUs we requested, which was also 1. We’re also shown as having requested 8 GB of memory and 7 hours 30 minutes of time, while at this point our jobs seems only to have used 1 minute of its allocated time (the “Elap Time” column). All jobs in the listing are in the “Running” state (R). Often when the system or a particular queue is heavily used, some jobs will be in a queued state (Q) instead. When jobs do get queued, they will wait until some other job in their queue finishes, giving them a chance to start execution. This gives you a taste of how PBS queuing and scheduling works.

If you wanted more detailed information on your job, you could type:

 qstat –f 30665.raad-mgmt

where 30665.raad‐mgmt is your job ID. If you wanted, you could also kill your job with:

qdel 30665.raad-mgmt

Viewing Job Output

By default PBS will write screen output from a job to files that have names with the following format:


Jobname.oJobID

This file would contain the standard (non‐error) output that would otherwise have been written to the screen. In our sample job, the strings “My job is about to start...” and “…my job has now finished.” would both be found in this file once the job completes execution. Furthermore, if our myProg.exe program normally produced output for the screen, that output too would be found in this file.

Jobname.eJobID

This file would contain the error output that would otherwise have been written to the screen. If something were to go wrong with the execution of myProg.exe in our sample program, we could expect to find relevant warning and or error messages for the problem in this file.

However, since in our case we used the PBS directive #PBS -j oe in our script, the non‐error and the error outputs are both written to the Jobname.oJobID file alone. In our example job, the actual name of this file would be “sillyJob.o30665”, and it would be created in the same directory from which we submitted our job.

Interactive Batch Jobs

Given that we drew a distinction between interactive and batch processing earlier in our explanation, the phrase “interactive batch job” would appear to be an oxymoron. Putting this apparent contradiction aside for now, the fact is that even on clusters like raad users do sometimes have a need to work with their applications interactively. Given the architecture of clusters, it would not be advisable to allow such users to run applications interactively on the head node itself. All user computational activity should take place on the compute nodes. Consequently, in order to obtain access to a compute node, even for interactive use, we still have to make the request via PBS. Remember, PBS is the intermediary that knows which compute nodes have resources available for use at any given time. When we make a request for interactive use of the system (via a PBS batch file) we call the resulting job an “interactive batch job” (or more conveniently, simply and "interactive job"). An interactive job is not an automated workflow as true batch jobs would be; instead, it simply gives us access to the command prompt on some compute node deputed by PBS to serve our needs for a specified amount of time. From there on, we use the allocated compute node in interactive fashion, and when finished with our work, we type “exit” at the compute node’s command prompt to end our interactive job.

Example Job File

The key to creating an interactive job with PBS is the use of the "–I" and "-V" directives. First, we would create a plain text file with the following contents:

#PBS –I
#PBS -V
#PBS –N myMatjob
#PBS –l walltime=01:35:00
#PBS –l ncpus=2,mem=4gb

Typically, a user will create several such job files, each with a varying amount of the requested resources such as walltime or memory, etc. and then launch any particular job instance using one of those files depending on job needs at the time. Note that interactive job files contain no executable commands, only PBS directives as seen above. Furthermore, an interactive job file is application agnostic, so that once you are granted an interactive session on some compute node, you could launch any application on that node.

Earlier, we have already covered the meaning of the PBS directives used above. Only the –I and –V directive switches are new here. The -I turns an otherwise regular batch job into an interactive job. What this implies is that while the job will be queued and scheduled like any other PBS batch job, when executed, the standard input, output, and error streams of the job are connected to the terminal session in which the job is submitted. The –V declares that all environment variables in the qsub command’s environment are to be exported to the batch job. Some of those variables are used later to redirect the display so that remote applications can display their graphical interfaces on the user’s local system. In reality though, the technical details of what -I and -V are doing are not relevant for most users; they must simply use the directives when they need to run interactive jobs.


Initial (one-time) Setup for Interactive Jobs

Before we launch our job, a little housekeeping is in order first, and this involves the following:

We probably want to cd into the directory from which we launched the job, and make that our current working directory. Note that by default batch jobs begin in the user’s home directory (e.g. /panfs/vol/f/fachaud74)

We need to configure certain display parameters so that any program with a graphical interface launched on our assigned compute node is rendered correctly to our desktop’s (or laptop’s) monitor. The tasks above can be automated with the addition of a few lines of code to your .bash_profile startup file. This can be accomplished with the following command:

 cat /panfs/vol/opt/local/share/bash_profile_addendum >> ~/.bash_profile

Be sure to type this exactly as shown above, and be certain you use the double greater‐than symbols (“>>”) in the command. This step only needs to be performed one time (and NOT every time you log in to the system).

Submitting the Job

Assuming we saved our example job file above as "myInteractive.job", here is what job submission should look like in our terminal session (on our screen):

[fachaud74@raad ~]$ qsub myInteractive.job
qsub: waiting for job 2570.raad-mgmt to start
qsub: job 2570.raad-mgmt ready

-bash: module: line 1: syntax error: unexpected end of file
-bash: error importing function definition for `module'
[fachaud74@n12 ~]$

[Ignore the two‐line error message regarding “module”; it is harmless.] We have now been connected to the terminal of some compute node assigned by PBS to our job request (node n12 in this case). At this point, we are ready to launch our software package (matlab in our example) as follows:

[fachaud74@n12:~]$ module load matlab
[fachaud74@n12:~]$ matlab

On raad, typically the user will need to “load” a module for any given package they wish to use. This is another mechanism to automate the settings required by a given application. More on the use of the module utility can be found in the section entitled “Using Software Modules”.

Remember, in the batch file we created earlier, we had requested an interactive job with 1 hour and 35 minutes of walltime. Consequently, once 95 minutes have passed, our matlab session will be killed by PBS regardless of whether we have finished our work or not. Therefore, we must be mindful to wrap up and save our results within the requested time limit. Once finished, we would exit from matlab, and then type “exit” at the compute node’s command prompt. This will end our job and place us back at raad’s command prompt.

Submitting Parallel Jobs

While we have seen a batch script for a simple serial (i.e. single-cpu) job, scripts for some parallel jobs need to be constructed differently. Parallel programs are commonly implemented using either the MPI or openMP programming paradigms. Since openMP jobs must be confined within a single node, their job scripts are almost identical to the serial job scripts, except that we increase the number of cpus requested for the job with the "#PBS -l ncpus=x" directive, where x can be up to 16 (or 32 when submitting to the fat queue).

With MPI jobs, however, we need to learn and to employ a new PBS directive to construct our job file. While most aspects of MPI use are beyond the scope of this document, we generally know that MPI jobs can span multiple nodes and utilize hardware resources from those nodes simultaneously to run a single program. How do we tell PBS that we need multiple nodes for our program? Furthermore, how do we specify the number of cpus and the amount of memory required from each node? Sometimes, there is even a need to distribute MPI processes across nodes a certain way so they are not packed together into the fewest possible number of nodes (but "spread out" in a certain way). How would one do that? The PBS select directive can help us with all these needs.

Select Statements and Chunks

Let us start with a simple example, one where our job will actually run within a single node, asking for all of its cpus and all available memory within that node.

 
#PBS -N mpiJob
#PBS -l walltime=120:00:00
#PBS -l select=1:ncpus=16:mpiprocs=16:mem=62gb

cd $PBS_O_WORKDIR
echo "My MPI job is about to start..."

module load intel/mpi/4.1/64bit
mpirun -n 16 ./myParProg.exe

echo "...my MPI job has now finished."

We know the meaning of the other directives; we will focus only on the select directive and how it operates. Note that we will refer to the text of the select directive as the "select statement". Select statements allow us to request a set of resources in so-called "chunks". In our example above, we are requesting a single chunk of resources that consists of 16 cpus, 16 MPI processes, and 62 GB of memory space. The mpiprocs here is not really a resource in the conventional sense. Rather, it is an instruction to PBS that we do not want PBS to spawn more than 16 MPI processes within this chunk of resources. The value of ncpus and mpiprocs should for all practical purposes always be the same. Each MPI process should have its own cpu to run on. The "mpirun -n 16" preceding the program name within the job script is how Intel's MPI tools launch MPI processes. The statement beginning with "module" simply sets the environment variable needed to use the MPI library. The module utility and how it is used is covered elsewhere.

An important concept to remember is that the set of resources specified within a single chunk must be allocated from within a single node. Consequently, PBS can satisfy resource requests for multiple chunks from a single node, but can never satisfy the needs specified within a single chunk from multiple nodes. The select statement itself is constructed as follows: the word "select" is followed by the equal sign (=), followed by an integer denoting the number of chunks being requested, followed by a colon (:) separated list of resources requested per chunk. No spaces are allowed throughout the statement. Each resource is specified by a resource keyword (e.g. ncpus, mem, etc.) followed by an = sign, followed by some value. This colon-separated resource list can be ordered in any way (of course, the integer denoting the number of chunks requested is not part of the resource list and must always be the first thing followed by the "select=").

Multiple Chunks

Now, if we wanted to make this MPI job run on 64 cpus, we would simply request 4 chunks like the one requested above:

 
#PBS -N mpiJob
#PBS -l walltime=120:00:00
#PBS -l select=4:ncpus=16:mpiprocs=16:mem=62gb

cd $PBS_O_WORKDIR
echo "My MPI job is about to start..."

module load intel/mpi/4.1/64bit
mpirun -n 64 ./myParProg.exe

echo "...my MPI job has now finished."

Notice that the number of MPI processes launched with mpirun using the -n switch should equal (and certainly not exceed) the total number of ncpus being requested in the select statement. In the example above, we are requesting 4 chunks, and within each of those chunks 16 cpus. So the total is clearly 4 x 16 = 64, and that is how many MPI processes we should tell mpirun to start. Note that in this example, PBS will allocate resources from 4 different nodes, with one chunk being allocated on each separate node.

Heterogeneous Chunks

What about the case where we need, let's say, 24 cpus? Must we ask for two chunks with 12 cpus each? Or must we ask for 2 chunks of 16 cpus and leave 8 of our allocated cpus unused? Not necessarily. There is another form to the select statement that can help us in this situation. Let's consider the following:

 select=1:ncpus=16:mpiprocs=16:mem=62gb+1:ncpus=8:mpiprocs=8:mem=31gb

In this statement, we join two chunk requests with the + sign in between. This statement will allocate for us all the resources from some single node, and half of the resources from some additional node. We still have two chunks, but both are of different sizes. We see that select can be quite flexible.

Job Placement

The final case we consider shows us the concept of placement when requesting resources from PBS in a select statement. Assume that we wish to run a 32-way MPI program in which, were we to place more than 8 processes on a single node, we might saturate the memory bandwidth of the node. For this reason, we'd like to spread out our processes so that every set of 8 cpus (and associated resources) are allocated from different nodes. We could attempt the following statement:

 select=4:ncpus=8:mpiprocs=8:mem=16gb

However, PBS will attempt to allocate two of the requested chunks from one node, and the two remaining chunks from some other node. That is the default resource allocation behavior, and it is the equivalent of adding the place directive "#PBS -l place=pack" in the job file. However, if we use instead the directive "#PBS -l place=scatter", we can achieve the intended result of scattering our allocations across different nodes, one chunk per node. The common values for the place keyword are "free", "pack" or "scatter". With "free", PBS chooses how to do the placement without being restricted by the rules imposed by the pack or scatter schemes. So here is what our job file could look like:

#PBS -N mpiJob
#PBS -l walltime=120:00:00
#PBS -l select=4:ncpus=8:mpiprocs=8:mem=16gb
#PBS -l place=scatter

cd $PBS_O_WORKDIR
echo "My MPI job is about to start..."

module load intel/mpi/4.1/64bit
mpirun -n 32 ./myParProg.exe

echo "...my MPI job has now finished."