Using the HKU CS GPU Farm (Advanced)

Sessions on this page:

Using SLURM Native srun Command to Allocate More time and GPUs in an Interactive Session
Running Batch Jobs
Using RTX3090 GPUs

Allocating More Time and GPUs in an Interactive Session

If you examine the gpu-interactive command script in /usr/local/bin on a gateway node, you will find that it calls the srun command of the SLURM system to do the session allocation actutally:

srun --gres=gpu:1 --pty --mail-type=ALL bash

The default time limit of a session is 6 hours, i.e., all processes started in a session will be terminated after 6 hours even if the user does not logout himself. To have a longer time limit, e.g., 12 hours, use the --time option:

srun --gres=gpu:1 --time=12:00:00 --pty --mail-type=ALL bash

The maximum time limit for an interactive session is 18 hours. If more time is needed, use sbatch (see the session below).

The GPU farm is configured to allocate 4 CPU cores and 40GB system (not GPU) RAM per GPU. To have more GPUs, CPU cores and RAM allocated, use the srun command with appropiate --gres=gpu and --cpus-per-task parameters, e.g.:

srun --nodes=1 --gres=gpu:2 --pty --mail-type=ALL bash

The above command requests a session with 2 GPUs and implicitly 8 CPU cores and 80GB system RAM in 1 node(server). GPU and CPU time quotas will be deducted accordingly.

Our RTX4080 servers are currently set up to support a maximum of 4GPUs, 16 CPU cores and 160GB RAM in a single session.

To prevent users from occuping too many resources and exhausting their quotas unintentionally, each users is limited to have 2 GPUs and 8 CPUs concurrently. The limit is should be sufficient for most AI assignments. Users who need more concurrent resources may contact support@cs.hku.hk with support reasons.

Please make sure that the software that you use supports multiple GPUs before requesting more than one GPU in a session. Otherwise your time quota will be wasted.

Running Batch Jobs

If your program does not require user interaction during execution, you can submit it to the system in a batch mode. The system will schedule your job in background. You do not need to keep a terminal session on a gateway node to wait for the output. A SLURM partition/queue batch is set up with a maximum time limit of 7 days. To submit a batch job,

Create a batch file, e.g., my-gpu-batch, with the following contents:
#!/bin/bash # Tell the system the resources you need. Adjust the numbers according to your need, e.g. # -p batch - use partition/queue named batch # --time=24:00:00 - set a time limit of 24 hours #SBATCH -p batch --nodes=1 --gres=gpu:1 --cpus-per-task=4 --time=24:00:00 --mail-type=ALL#If you use Anaconda, initialize it . $HOME/anaconda3/etc/profile.d/conda.sh conda activate tensorflow # cd your your desired directory and execute your program, e.g. cd _to_your_directory_you_need _run_your_program_
On a gateway node (gpu2gate1 or gpu2gate2), submit your batch job to the system with the following command on a .:
sbatch my-gpu-batch

Note the job id displayed. The output of your program will be saved in slurm-<job id>.out

A mail will be sent to you when your job starts and ends.

Use "squeue -u $USER" to see the status of your jobs in the system queue.

To cancel a job note the job ids of "squeue -u $USER" and use 'scancel <job id>' to cancel a job.

The concurrent CPU and GPU limits (2 GPU/8 CPU) also apply to batch jobs. If you need to run multiple batch jobs concurrently. Please contact support@cs.hku.hk for a temperorary arrangement of increasing your concurrent limit.

Using RTX3090 GPUs

For users who needs more GPU and system memory, a small number of RTX3090 GPUs with 24GB GPU memory, connected in pairs with NVLink bridges, are available.

To start a session with RTX3090, use the command line option '-p q-3090' with srun and sbatch commands. For example to request an interactive session with one RTX3090:

srun -p q-3090 --gres=gpu:1 --pty --mail-type=ALL bash

The session with one 3090 GPU, 8 CPU cores and 112GB system RAM.

To request a session with 2 RTX3090 GPUs connected with a NVLink bridge (and 16 CPU cores and 224GB RAM implicitly):

srun -p q-3090 --gres=gpu:2 --pty --mail-type=ALL bash

The default time and maximum time limits of an RTX3090 session are also 6 hours and 18 hours. To have a longer session, use '-p q-3090-batch' with sbatch. For example, the following #SBATCH directive instructs the sbatch command to run a job in a 24 hour session, and a reminder email will be sent when 80% of the time limit is reached:

#!/bin/bash
#SBATCH -p q-3090 --gres=gpu:2 -t 24:00:00 --mail-type=ALL,TIME_LIMIT_80
your_script_starts_here

Further Information

Please visit the official site of the SLURM Workload Manager for further documentation on using SLURM.