Submitting Jobs

From TAMUQ Research Computing User Documentation Wiki
Jump to navigation Jump to search


Interactive Jobs

Submitting an interactive job to GPU cluster establishes direct terminal access to a GPU node, where you can test/develop your code before actually submitting in batch.

To launch an interactive job, you can issue "sinteractive" command

abtakid91@raad2-gfx:~$ sinteractive
abtakid91@gfx1:~$

You will notice that in the terminal, raad2-gfx has changed to gfx1. This means that you are now on a GPU node.

Interactive Python Job

Load python

abtakid91@gfx1:~$ module load python36

Let us activate the sample dlproject virtual environment we created and start testing:

abtakid91@gfx1:~$ conda activate dlproject
(dlproject) abtakid91@gfx1:~$ python dl.py

After making sure that everything in the code is working fine, deactivate your virtual environment, exit to the login node, and you are ready to make a Batch submission.

To deactivate your virtual environment:

(dlproject) abtakid91@gfx1:~$ conda deactivate

To exit to the login node:

abtakid91@gfx1:~$ exit

Interactive CUDA Job

A sample CUDA code is placed at "/lustre/share/examples/gpu/add.cu". A very nice tutorial on this can be found C++/CUDA here.

1. Copy sample cuda code in your home directory

muarif092@raad2-gfx:~$ cp /ddn/share/examples/gpu-tutorial/01_cuda/add.cu .

2. Submit an interactive job to compile cuda code

muarif092@raad2-gfx:~$ sinteractive
muarif092@gfx1:~$

3. Load CUDA modules

muarif092@gfx1:~$ module load cuda

4. Compile sample code using Nvidia Cuda Compiler (nvcc)

muarif092@gfx1:~$ which nvcc
/cm/shared/apps/cuda90/toolkit/9.0.176/bin/nvcc
muarif092@gfx1:~$ nvcc cuda.cu -o add_cuda

5. Run the executable

muarif092@gfx1:~$ ./add_cuda
Max error: 0

6. Profile your code

muarif092@gfx1:~$ nvprof ./add_cuda
==369808== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  4.9192ms         1  4.9192ms  4.9192ms  4.9192ms  add(int, float*, float*)
...
...
==369808== Unified Memory profiling result:
Device "Tesla V100-PCIE-16GB (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
      48  170.67KB  4.0000KB  0.9961MB  8.000000MB  796.7360us  Host To Device
....
...

7. Exit from Interactive Job

abtakid91@gfx1:~$ exit

Batch Jobs

Sample Slurm Job file for Python

To run the sample job, you have to:

  1. Copy the sample job file to your working directory.
    abtakid91@raad2-gfx:~$ cp /lustre/share/examples/gpu/gpu.job .
    
  2. In the job file, line 11, change <env_name> to the name of your virtual envirnment. (e.g. dlproject)
  3. In the job file, line 15, change myapp.py to the name of your python file. (e.g. dl.py)

Your sample python job will then look like this:

#!/bin/bash
#SBATCH -J batch
#SBATCH --time=24:00:00
#SBATCH --ntasks=18
#SBATCH --gres=gpu:v100:1

module load cuda90/toolkit
source /cm/shared/apps/anaconda3/etc/profile.d/conda.sh
conda activate dlproject

export OMP_NUM_THREADS=18

srun --ntasks=1 python dl.py

Then you will be able to run it:

abtakid91@raad2-gfx:~$ sbatch gpu.job

Sample Slurm Job file for Cuda

Below is a sample Slurm job file for Cuda program. The source file "add.cu" can be found here; "/lustre/share/examples/gpu/add.cu"

#!/bin/bash
#SBATCH -J batch
#SBATCH --time=24:00:00
#SBATCH --ntasks=18
#SBATCH --gres=gpu:v100:1

module load cuda90/toolkit

export OMP_NUM_THREADS=18
srun --ntasks=1  nvcc add.cu -o add_cuda
srun --ntasks=1 ./add_cuda
</source>

Now submit batch job
<source lang="cpp">
abtakid91@raad2-gfx:~$ sbatch gpu.job

The output of this job will be placed in the same directory.