HKUCDS GPU Farm for Research (Phase 3)
(The reader of this page is assumed to have read the Quick Start and Advanced Use pages.)
Sessions on this page
Introduction
HKUCDS GPU Farm for Research (Phase 3) is a SLURM cluster of servers with NVIDIA RTX4090(24GB) and H800 SXM (80GB) GPUs. It is available to staff, PhD and MPhil students of the School of Computing and Data Science.
User account can be applied at https://intranet.cs.hku.hk/gpufarm3_acct/. After your account is created, you may login the gateway node gpu3gate1.cs.hku.hk with SSH:
ssh <your_username>@gpu3gate1.cs.hku.hk
Note: the user accounts and home directories are independent of Phase 1 and 2 and are not shared.
Running a Session with one GPU
The following SLURM partitions are defined in HKUCDS GPU Farm for Research:
Partition | GPU type | No. of GPUs per server | Default CPU cores per GPU | Default Server RAM per GPU | Default Time Limit | Maximum Time Limit | Remarks |
debug (default) | RTX4090 (24GB) | 2,8,10 | 4 | 96GB | 6 Hours | 7 Days | |
q-h800 | H800 (80GB) | 8 | 4 | 240GB | 6 Hours | 2 Days | at most one job only at a time |
q-hgpu-batch | H100/H800 (80GB) | 8 | 4 | 240GB | 2 Days | 7 Days | sbatch jobs only |
After logging on the gateway node, a GPU session can be started with srun, e.g.,
srun --gres=gpu:1 --mail-type=ALL --pty bash
The default SLURM queue (debug) allocates RTX4090 GPUs. 4 CPU cores and 96GB system RAM is allocated with each GPU. To have a session with 2 GPUs:
srun --nodes=1 --gres=gpu:2 --mail-type=ALL --pty bash
By default, each user account can request up to 4 GPUs concurrently. The limit can be raised on request.
Specifying the longer time limit
A job will be terminated when its time limited is reached. Use '-t' to specify a longer time limit than the default. For example, to have a time limit of 12 hours:
srun --nodes=1 --gres=gpu:2 -t 12:00:00 --mail-type=ALL --pty bash
Running a Session with one H800 GPU
To get a session with a H800 GPU, use the q-h800 partition by adding '-p q-h800' in srun or sbatch, e.g.,
srun -p q-h800 --gres=gpu:1 --mail-type=ALL --pty bash
4 CPU cores and 240GB system RAM is allocated with each H800 GPU.
Running a Session with 2 H800 GPUs
srun -p q-h800 --nodes=1 --gres=gpu:2 --mail-type=ALL --pty bash
Submitting Batch Jobs
If you program runs for days and does not require user interaction during execution, you can submit it to the system in a batch mode. The system will schedule your job to run when the requested GPUs are available.
To submit a batch job from the gateway node gpu3gate1,
- Create a batch file, e.g., my-gpu-batch, with the following contents:
#!/bin/bash
# Tell the system the resources you need. Adjust the numbers according to your need# specify the partition to use and GPUs needed with -p and --gres optons, e.g.
# '--gres=gpu:4' for four RTX4090 GPUs# '-p q-hgpu-batch --gres=gpu:2' for two H100 or H800 GPUs
# '-p q-hgpu-batch --gres=gpu:h100:2 for two H100 GPUs# '-p q-hgpu-batch --gres=gpu:h800:4 for four H800 GPUs
#SBATCH --nodes=1 --gres=gpu:4 --mail-type=ALL# Specify a time limit if needed, e.g., 4 days
#SBATCH -t 4-00:00:00
#If you use Anaconda, initialize it
. $HOME/anaconda3/etc/profile.d/conda.sh
conda activate my_env
# cd your your desired directory and execute your program, e.g.
cd _to_your_directory_you_need
_run_your_program_
- Submit your batch job to the system with the following command on a .:
sbatch my-gpu-batch
Usage Examples with DeekSeek R1 models
The following examples show how to install and run an OpenAI API server with DeepSeek R1 locally in the GPU farm. Distilled models of DeekSeek R1 are downloaded in /share/deepseek-ai for convenience.
Installing SGLang
SGLang is a framework for large language models and vision language models. An OpenAI-compatible APIs server is included. The following steps assume that Anaconda is installed.
- On gpu3gate1.cs.hku.hk, request a GPU session
srun --gres=gpu:1 --mail-type=ALL --pty bash
- On the GPU node, create a new conda enviroment:
conda create -n deekseek python=3.10
- Activate the environment:
conda activate deekseek
- Install SGLang (ref: https://docs.sglang.ai/start/install.html)
pip install "sglang[all]>=0.4.4.post1" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
- Logout the session. The conda environment will be used for running the API server (see below).
Running DeepSeek R1 models with one RTX4090 GPU
Smaller distilled models of DeepSeek R1, e.g., DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B can run with one RTX4090 GPU.
- On gpu3gate1.cs.hku.hk, request a GPU session
srun --gres=gpu:1 --mail-type=ALL --pty bash
And note the hostname of the GPU server assigned, e.g., gpu-4090-201, either from the command prompt, or using the 'hostname' command.
- Activate the conda environment that have SGLang installed in the previous session:
conda activate deekseek
- Start the server using one DeepSeek-R1-Distill-Qwen-7B for example:
python3 -m sglang.launch_server --served-model-name deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
After the SGLang server is started up, a message will be shown that it is running on http://127.0.0.1:30000.
--model-path /share/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --trust-remote-code - Open a new terminal on your local computer. Login the same GPU server, using the hostname you noted in step 1:
ssh <your_usenname>@gpu-4090-201.cs.hku.hk
- On this new SSH session, query the model name:
curl http://127.0.0.1:30000/v1/models
The id of the model should be the same as the --served-model-name parameter in the previous step. - Asked the server a question, e.g.,
curl http://localhost:30000/v1/completions -H "Content-Type: application/json" \
-d '{ "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", "prompt": "Who are you?", "max_tokens": 1024, "temperature": 0.6 }'
Running DeekSeek R1 models with one H800 GPU
DeepSeek-R1-Distill-Qwen-32B cannot fit in an RTX4090, but can run in a single H800.
- On gpu3gate1.cs.hku.hk, request a H800 GPU session
srun -p q-h800 --gres=gpu:h800:1 --mail-type=ALL --pty bash
And note the hostname of the GPU server assigned, e.g., gpucluster-g1, either from the command prompt, or using the 'hostname' command.
- Activate the conda environment that have SGLang installed in the previous session:
conda activate deekseek
- Start the server:
python3 -m sglang.launch_server --served-model-name deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
After the SGLang server is started up, a message will be shown that it is running on http://127.0.0.1:30000.
--model-path /share/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --trust-remote-code - Open a new terminal on your local computer. Login the same GPU server, using the hostname you noted in step 1:
ssh <your_usenname>@gpucluster-g1.cs.hku.hk
- On this new SSH session, query the model name:
curl http://127.0.0.1:30000/v1/models
The id of the model should be the same as the --served-model-name parameter in the previous step. - Asked the server a question, e.g.,
curl http://localhost:30000/v1/completions -H "Content-Type: application/json" \
-d '{ "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B", "prompt": "List some interesting facts in Mathematics about the number 2025", "max_tokens": 1024, "temperature": 0.6 }'
Running Large DeekSeek R1 models with multiple H800 GPU
DeepSeek-R1-Distill-Llama-70B needs two H800 GPUs to run.
- On gpu3gate1.cs.hku.hk, request a session with two H800 GPUs
srun -p q-h800 --gres=gpu:h800:2 --mail-type=ALL --pty bash
And note the hostname of the GPU server assigned, e.g., gpucluster-g1, either from the command prompt, or using the 'hostname' command.
- Activate the conda environment that have SGLang installed in the previous session:
conda activate deekseek
- Start the server with 2 GPUs (--tp 2):
python3 -m sglang.launch_server --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
After the SGLang server is started up, a message will be shown that it is running on http://127.0.0.1:30000.
--model-path /share/deepseek-ai/DeepSeek-R1-Distill-Llama-70B --trust-remote-code --tp 2 - Open a new terminal on your local computer. Login the same GPU server, using the hostname you noted in step 1:
ssh <your_usenname>@gpucluster-g1.cs.hku.hk
- On this new SSH session, query the model name:
curl http://127.0.0.1:30000/v1/models
The id of the model should be the same as the --served-model-name parameter in the previous step. - Asked the server a question, e.g.,
curl http://localhost:30000/v1/completions -H "Content-Type: application/json" \
-d '{ "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B", "prompt": "Write a python program to display Hello World.", "max_tokens": 1024, "temperature": 0.6 }'