RAAD-2 -- Beta

From TAMUQ Research Computing User Documentation Wiki
Jump to navigation Jump to search


Introduction

The new cluster, named "Raad-2", is Linux based system equipped with CRAY cutting edge blade servers with total of 4,128 traditional CPU cores of the Intel Haswell architecture. All compute nodes contain 2 processors per node with 12-cores each and 128GB of RAM. The interconnect fabric is CRAY Aries network with DragonFly topology, providing premier level of performance for massive parallel jobs which require message passing interface. System​ is running SLURM as workload manager to efficiently allocate and manage compute resources.

RAAD-2​ is paired with 800TB parallel storage system from DDN. The DDN EXAScaler(Lustre) system is connected to the cluster's ARIES network allowing research applications to access and process large amounts of data at higher speed. System peak aggregated read bandwidth is 16GB/s; whereas a single client peak concurrent bandwidth is 2.5GB/s read and 2.5GB/s write.

Accessing RAAD-2

As a beta user of our new machine, you must have received introductory email along with your login credentials and getting started instruction.

Login to machine

You can login to system using any SSH client, but make sure to VPN to TAMUQ network if you are trying to access machine from outside campus.

ssh <user_name>@raad2b.qatar.tamu.edu

Change password

Once you successfully login to machine, before you submit any job, you should change your password. This password is NOT associated with your TAMUQ domain account or RAAD-1 password. In Old password field you should enter your current password which is emailed to you. As you type your current Password and New password, you will notice cursors doesn't moves, this is absolutely normal and you can continue to enter your password and hit enter.

mustarif63@raad2b:~> passwd
Changing password for mustarif63.
Old Password:
New Password:
Reenter New Password:
Changing NIS password for mustarif63 on raad2-smw.
Password changed.
mustarif63@raad2b:~>

Locate home directory

Your home directory is mounted from /luster/home/<user_name> from shared storage system. When you login to machine, by default you land into your home directory.

mustarif63@raad2b:~> pwd
/lustre/home/mustarif63
mustarif63@raad2b:~>

Applications available on System

There are multiple (actually numerous) applications and compilers which came already installed on our new system. You can list all apps/compilers and their versions using simple command below;

mustarif63@raad2b:~> module avail

Note: Not all compilers which appears in list are necessarily relevant to our system, they are part of default CRAY configuration. As we move forward, we will make sure only relevant compilers appear in this list.

To setup any specific compiler in your environment, you can issue below;

mustarif63@raad2b:~> module load <module_name>

Submitting Jobs

RAAD-2 is running SLURM for workload manager and job scheduling. Job submission commands which worked on RAAD-1 wont work here and you will need to modify your job submission script to be compatible with SLURM. Here are few commands related to SLURM which can help you get started.

View available Queues

In SLURM Queues are known as partitions. You can issue 'sinfo' to list available partitions in system. Below you can see there is one partition named "workq" which has 32 nodes in state "idle" (available for computation). Refer to sinfo Help Page for more details.

mustarif63@raad2b:~> sinfo
PARTITION AVAIL JOB_SIZE  TIMELIMIT   CPUS  S:C:T   NODES STATE      NODELIST
workq*    up    1-infini 1-00:00:00     48 2:12:2     140 maint      nid00[040-063,200-255,388-447]
workq*    up    1-infini 1-00:00:00     48 2:12:2      32 idle       nid000[08-39]

Create Job File

A sample job file in SLURM will look like below;

#!/bin/sh
#SBATCH -J DemoJob
#SBATCH --time=00:30:00
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks=4
#SBATCH --exclusive
#SBATCH --output=DemoJob.o%j
#SBATCH --error=DemoJob.o%j
srun --ntasks=4 --tasks-per-node=4 hostname

Submitting job file

mustarif63@raad2b:~> sbatch MySlurm.job
Submitted batch job 973

More information on sbatch can be found at sbatch Help Page

List running jobs

squeue

More information on squeue can be found at squeue Help Page

List running job info

scontrol show job <job_id>

Delete running job

scancel <job_id>