RAAD-2 -- Beta
Introduction
The new cluster, named "Raad-2", is Linux based system equipped with CRAY cutting edge blade servers with total of 4,128 traditional CPU cores of the Intel Haswell architecture. All compute nodes contain 2 processors per node with 12-cores each and 128GB of RAM. The interconnect fabric is CRAY Aries network with DragonFly topology, providing premier level of performance for massive parallel jobs which require message passing interface. System is running SLURM as workload manager to efficiently allocate and manage compute resources.
RAAD-2 is paired with 800TB parallel storage system from DDN. The DDN EXAScaler(Lustre) system is connected to the cluster's ARIES network allowing research applications to access and process large amounts of data at higher speed. System peak aggregated read bandwidth is 16GB/s; whereas a single client peak concurrent bandwidth is 2.5GB/s read and 2.5GB/s write.
Accessing RAAD-2
As a beta user of our new machine, you must have received introductory email along with your login credentials and getting started instruction.
Login to machine
You can login to system using any SSH client, but make sure to VPN to TAMUQ network if you are trying to access machine from outside campus.
ssh <user_name>@raad2b.qatar.tamu.edu
Change password
Once you successfully login to machine, before you submit any job, you should change your password. This password is NOT associated with your TAMUQ domain account or RAAD-1 password. In Old password field you should enter your current password which is emailed to you. As you type your current Password and New password, you will notice cursors doesn't moves, this is absolutely normal and you can continue to enter your password and hit enter.
mustarif63@raad2b:~> passwd
Changing password for mustarif63.
Old Password:
New Password:
Reenter New Password:
Changing NIS password for mustarif63 on raad2-smw.
Password changed.
mustarif63@raad2b:~>
Locate home directory
Your home directory is mounted from /luster/home/<user_name> from shared storage system. When you login to machine, by default you land into your home directory.
mustarif63@raad2b:~> pwd
/lustre/home/mustarif63
mustarif63@raad2b:~>
Applications available on System
There are multiple (actually numerous) applications and compilers which came already installed on our new system. You can list all apps/compilers and their versions using simple command below;
mustarif63@raad2b:~> module avail
Note: Not all compilers which appears in list are necessarily relevant to our system, they are part of default CRAY configuration. As we move forward, we will make sure only relevant compilers appear in this list.
To setup any specific compiler in your environment, you can issue below;
mustarif63@raad2b:~> module load <module_name>
Submitting Jobs
RAAD-2 is running SLURM for workload manager and job scheduling. Job submission commands which worked on RAAD-1 wont work here and you will need to modify your job submission script to be compatible with SLURM. Here are few commands related to SLURM which can help you get started.
View available Queues
In SLURM Queues are known as partitions. You can issue 'sinfo' to list available partitions in system. Below you can see there is one partition named "workq" which has 32 nodes in state "idle" (available for computation). Refer to sinfo Help Page for more details.
mustarif63@raad2b:~> sinfo
PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE NODELIST
workq* up 1-infini 1-00:00:00 48 2:12:2 140 maint nid00[040-063,200-255,388-447]
workq* up 1-infini 1-00:00:00 48 2:12:2 32 idle nid000[08-39]
Create Job File
A sample job file in SLURM will look like below;
#!/bin/sh
#SBATCH -J DemoJob
#SBATCH --time=00:30:00
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks=4
#SBATCH --exclusive
#SBATCH --output=DemoJob.o%j
#SBATCH --error=DemoJob.o%j
srun --ntasks=4 --tasks-per-node=4 hostname
Submitting job file
mustarif63@raad2b:~> sbatch MySlurm.job
Submitted batch job 973
More information on sbatch can be found at sbatch Help Page
List running jobs
squeue
More information on squeue can be found at squeue Help Page
List running job info
scontrol show job <job_id>
Delete running job
scancel <job_id>