Getting started with RAAD2

From TAMUQ Research Computing User Documentation Wiki
Jump to navigation Jump to search


Introduction

The new cluster, named "raad2", is a Linux based system from the vendor Cray and has a total of 4,128 traditional CPU cores of the Intel Haswell architecture. Each of its 172 compute nodes contains 24 physical CPU cores -- 2 processor sockets with 12 cores per socket -- and 128GB of RAM. The interconnect is comprised of the Cray Aries network, which is employed both for MPI as well as storage traffic. Raad2 runs SLURM as its workload manager, and​ is paired with a storage system from the vendor DDN that uses the parallel filesystem Lustre, providing 800TB of usable disk capacity accessible from all nodes.

Accessing Raad2

Once your account is approved, you will receive an email notification containing temporary login credentials and some basic guidance on how to access the system.

Login to machine

You may login to the system using any SSH client, but make sure to VPN into the TAMUQ network if you are accessing the machine from outside the TAMUQ building.

[mustarif63@raad2b]% ssh <user_name>@raad2.qatar.tamu.edu

Note that the system login nodes are actually named "raad2a" and "raad2b" and you will automatically be directed to one or the other of these; it does not matter which login node you land on, or if you land on different ones in different login sessions -- both are configured identically and you may treat them as one and the same, even though they are two distinct physical servers.

Change password

You will be prompted to change your initial password on your first login to raad2. Note that unlike with raad, your raad2 account password is NOT linked to your TAMUQ domain account, although you may choose to set your password string to be identical to that other password. In order to change your initial password, as you type your current password and new password, you will notice the cursor doesn't move; this is normal and you may continue to enter your password string and proceed to hit enter thereafter.

Locate home directory

Your home directory location is /luster/home/<user_name> and it resides on the shared storage system. When you login to the machine, by default you land in this location.

mustarif63@raad2b:~> pwd
/lustre/home/mustarif63
mustarif63@raad2b:~>

Applications available on System

You can make TAMUQ specific applications available in system with below statement;

mustarif63@raad2b:~> module use /lustre/opt/modulefiles/tamuq

There are multiple applications and compilers which came already installed on our new system. You can list all apps/compilers and their versions using simple command below;

mustarif63@raad2b:~> module avail

Note: Not all compilers which appears in list are necessarily relevant to our system, you would probably be more interested in Cray/Intel/PGI compilers and other opensource packages like FFTW, HDF5, Python etc. These packages are available and can be see with module avail command.

To setup any specific compiler/package in your environment, you can issue below;

mustarif63@raad2b:~> module load <module_name>

Submitting Jobs

Raad2 is running the SLURM workload manager for job scheduling. Functionally it is similar to the PBS workload manager users may be familiar with on raad, but it uses a different syntax to formulate resource requests and specify scheduling and placement needs. Note that whereas PBS referred to "queues" SLURM uses the term "partitions" to refer to the same concept.

Partition Configuration and Resource Limits

 Partition   QOS   Per Partition CPU limit   Per User CPU limit   Per Job CPU limit   Max WallTime 
s_short ss 144 24 8 08:00:00
s_long sl 456 48 16 168:00:00
s_debug sd 48 24 -- 04:00:00
l_short ls 144 48 -- 08:00:00
l_long ll 3000 360 -- 168:00:00
express ex 96 48 -- 01:00:00

The partitions are generally divided into two broad categories, with the "small" partitions having a "s_" prefix in their names, and the "large" partitions having a "l_" prefix in their names. The implied "small" and "large" here refer to the relative maximum size of jobs allowed to run within each partition -- and by size we mean the number of CPU cores requested by the job. The second portion of every partition name hints at the relative maximum length of walltime for jobs running within that partition. So the "short" partitions will run the relatively shorter jobs and the "long" partitions will run the relatively longer jobs. The s_debug queue is meant to be used primarily for testing the viability and correctness of jobs files, and for running short test calculations before submitting workloads to the s_* production partitions. For jobs destined for the l_* queues, the l_short queue may be used for this purpose, in addition to running production jobs.

The small partitions are meant for users running small jobs. We have defined small as "requiring anywhere from 1 to 8 cores" in the s_short partition or "requiring anywhere from 1 to 16 cores" in the s_long partition. In all cases, jobs submitted to the small partitions must fit *within* a single node, and can never span multiple nodes. Furthermore, all small jobs run on a set of nodes that are "sharable"; in other words, each of these nodes is able to run multiple jobs from multiple users simultaneously -- just like on raad.

The large partitions, on the other hand, only allocate whole nodes to users, and not individual cores. No two users running in one of the large partitions should be landing on the same node simultaneously; these nodes are for exclusive use by any given user and are NOT "sharable". However, a single user may submit multiple jobs to a single node if they so desire. Note that the set of nodes that service the large partitions is distinct from the set that services the small partitions, and there is no overlap between the two sets.

How to View Available Partitions

In SLURM, queues are known as "partitions". You can issue the 'sinfo' command to list available partitions and to query their current state. Refer to sinfo Help Page for more details.


mustarif63@raad2b:~> sinfo -s
PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T)  NODELIST
s_short      up    8:00:00        0/14/0/14  nid00[058-063,200-207]
s_long       up 7-00:00:00        0/14/0/14  nid00[058-063,200-207]
s_debug      up    4:00:00          0/2/0/2  nid00[208-209]
l_short      up    8:00:00       28/14/6/48  nid000[09-11,13-57]
l_long       up 7-00:00:00       28/14/6/48  nid000[09-11,13-57]

Create Job File

A sample job file in SLURM will look like below;

#!/bin/sh
#SBATCH -J DemoJob
#SBATCH -p l_long
#SBATCH --qos=ll
#SBATCH --time=00:30:00
#SBATCH --ntasks=24
#SBATCH --output=DemoJob.o%j
#SBATCH --error=DemoJob.e%j
srun --ntasks=24 ./a.out

Submitting job file

mustarif63@raad2b:~> sbatch MySlurm.job
Submitted batch job 973

More information on sbatch can be found at sbatch Help Page

List running jobs

squeue

More information on squeue can be found at squeue Help Page

List running job info

scontrol show job <job_id>

Delete running job

scancel <job_id>

Serial Jobs

Jobs which doesn't span over multiple nodes or serial jobs, should be submitted with parameter --gres=craynetwork:0 to avoid resource wastage. Cray has 4 network resources available on each node, in order to run parallel jobs. If you don't at all specify this parameter, cray submit plugin by default adds a request for craynetwork resource considering that job is going to need a network resource. But for serial jobs, such parameter is not required and only 4 jobs per node gets scheduled. To avoid this scenario, where other cores on node are wasted, all users submitting to serial queues i.e. s_long, s_short, s_debug partitions to specify --gres=craynetwork:0, telling Cray submit plugin that this job doesn't require any network resource for node to node communication.

Interactive Jobs

ssh raad2-login2
srun --pty --qos sd -p s_debug --gres=craynetwork:0 --time=00:30:00 /bin/bash

MPI Jobs

Cray Compiler & Cray MPI

demo_mpi.c

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
    // Initialize the MPI environment
    MPI_Init(NULL, NULL);

    // Get the number of processes
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    // Get the rank of the process
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    // Get the name of the processor
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);

    // Print off a hello world message
    printf("Hello world from processor %s, rank %d"
           " out of %d processors\n",
           processor_name, world_rank, world_size);

    // Finalize the MPI environment.
    MPI_Finalize();
}

slurm.job

#!/bin/bash
#SBATCH -J mpi_tester
#SBATCH -p express
#SBATCH --qos ex
#SBATCH --time=00:05:00
#SBATCH -N 2
#SBATCH --ntasks-per-node=24
#SBATCH --hint=nomultithread

srun -n 48 ./demo_mpi.out

Compilation and Submission Process

Step 1. Compile your C code

Make sure that modules PrgEnv-cray and cray-mpich are loaded by using `module list` command.

muarif092@raad2a:~> cc -o demo_mpi.o demo_mpi.c

Step 2. Make a sample job file and submit to raad2

Copy contents of `slurm.job` header above in a file named `slurm.job` and then submit job.

muarif092@raad2a:~> sbatch slurm.job