GPU Farm for Research (Phase 3)

HKUCDS GPU Farm for Research (Phase 3)

(The reader of this page is assumed to have read the Quick Start and Advanced Use pages.)

Sessions on this page

Introduction
Usage Examples for running OpenAI API Server with DeepSeek R1

Introduction

HKUCDS GPU Farm for Research (Phase 3) is a SLURM cluster of servers with NVIDIA RTX4090(24GB) and H800 SXM (80GB) GPUs. It is available to staff, PhD and MPhil students of the School of Computing and Data Science.

User account can be applied at https://intranet.cs.hku.hk/gpufarm3_acct/. After your account is created, you may login the gateway node gpu3gate1.cs.hku.hk with SSH:

ssh <your_username>@gpu3gate1.cs.hku.hk

Note: the user accounts and home directories are independent of Phase 1 and 2 and are not shared.

Running a Session with one GPU

The following SLURM partitions are defined in HKUCDS GPU Farm for Research:

Partition	GPU type	No. of GPUs per server	Default CPU cores per GPU	Default Server RAM per GPU	Default Time Limit	Maximum Time Limit	Remarks
debug (default)	RTX4090 (24GB)	2,8,10	4	96GB	6 Hours	7 Days
q-h800	H800 (80GB)	8	4	240GB	6 Hours	2 Days	at most one job only at a time
q-hgpu-batch	H100/H800 (80GB)	8	4	240GB	2 Days	7 Days	sbatch jobs only

After logging on the gateway node, a GPU session can be started with srun, e.g.,

srun --gres=gpu:1 --mail-type=ALL --pty bash

The default SLURM queue (debug) allocates RTX4090 GPUs. 4 CPU cores and 96GB system RAM is allocated with each GPU. To have a session with 2 GPUs:

srun --nodes=1 --gres=gpu:2 --mail-type=ALL --pty bash

By default, each user account can request up to 4 GPUs concurrently. The limit can be raised on request.

Specifying the longer time limit

A job will be terminated when its time limited is reached. Use '-t' to specify a longer time limit than the default. For example, to have a time limit of 12 hours:

srun --nodes=1 --gres=gpu:2 -t 12:00:00 --mail-type=ALL --pty bash

Running a Session with one H800 GPU

To get a session with a H800 GPU, use the q-h800 partition by adding '-p q-h800' in srun or sbatch, e.g.,

srun -p q-h800 --gres=gpu:1 --mail-type=ALL --pty bash

4 CPU cores and 240GB system RAM is allocated with each H800 GPU.

Running a Session with 2 H800 GPUs

srun -p q-h800 --nodes=1 --gres=gpu:2 --mail-type=ALL --pty bash

Submitting Batch Jobs

If you program runs for days and does not require user interaction during execution, you can submit it to the system in a batch mode. The system will schedule your job to run when the requested GPUs are available.

To submit a batch job from the gateway node gpu3gate1,

Create a batch file, e.g., my-gpu-batch, with the following contents:
#!/bin/bash # Tell the system the resources you need. Adjust the numbers according to your need
# specify the partition to use and GPUs needed with -p and --gres optons, e.g. # '--gres=gpu:4' for four RTX4090 GPUs
# '-p q-hgpu-batch --gres=gpu:2' for two H100 or H800 GPUs # '-p q-hgpu-batch --gres=gpu:h100:2 for two H100 GPUs
# '-p q-hgpu-batch --gres=gpu:h800:4 for four H800 GPUs #SBATCH --nodes=1 --gres=gpu:4 --mail-type=ALL# Specify a time limit if needed, e.g., 4 days #SBATCH -t 4-00:00:00#If you use Anaconda, initialize it . $HOME/anaconda3/etc/profile.d/conda.sh conda activate my_env # cd your your desired directory and execute your program, e.g. cd _to_your_directory_you_need _run_your_program_
Submit your batch job to the system with the following command on a .:
sbatch my-gpu-batch

Usage Examples with DeekSeek R1 models

The following examples show how to install and run an OpenAI API server with DeepSeek R1 locally in the GPU farm. Distilled models of DeekSeek R1 are downloaded in /share/deepseek-ai for convenience.

Installing SGLang

SGLang is a framework for large language models and vision language models. An OpenAI-compatible APIs server is included. The following steps assume that Anaconda is installed.

On gpu3gate1.cs.hku.hk, request a GPU session
```
srun --gres=gpu:1 --mail-type=ALL --pty bash
```
On the GPU node, create a new conda enviroment:
```
conda create -n deekseek python=3.10
```
Activate the environment:
```
conda activate deekseek
```

Install SGLang (ref: https://docs.sglang.ai/start/install.html)

pip install "sglang[all]>=0.4.4.post1" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python

Logout the session. The conda environment will be used for running the API server (see below).

Running DeepSeek R1 models with one RTX4090 GPU

Smaller distilled models of DeepSeek R1, e.g., DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B can run with one RTX4090 GPU.

On gpu3gate1.cs.hku.hk, request a GPU session
```
srun --gres=gpu:1 --mail-type=ALL --pty bash
```
And note the hostname of the GPU server assigned, e.g., gpu-4090-201, either from the command prompt, or using the 'hostname' command.
Activate the conda environment that have SGLang installed in the previous session:
```
conda activate deekseek
```
Start the server using one DeepSeek-R1-Distill-Qwen-7B for example:
```
python3 -m sglang.launch_server --served-model-name deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--model-path /share/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --trust-remote-code
```
After the SGLang server is started up, a message will be shown that it is running on http://127.0.0.1:30000.
Open a new terminal on your local computer. Login the same GPU server, using the hostname you noted in step 1:
```
ssh <your_usenname>@gpu-4090-201.cs.hku.hk
```
On this new SSH session, query the model name:
```
curl http://127.0.0.1:30000/v1/models
```
The id of the model should be the same as the --served-model-name parameter in the previous step.

Asked the server a question, e.g.,

curl http://localhost:30000/v1/completions -H "Content-Type: application/json" \
-d '{ "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", "prompt": "Who are you?", "max_tokens": 1024, "temperature": 0.6 }'

Running DeekSeek R1 models with one H800 GPU

DeepSeek-R1-Distill-Qwen-32B cannot fit in an RTX4090, but can run in a single H800.

On gpu3gate1.cs.hku.hk, request a H800 GPU session
```
srun -p q-h800 --gres=gpu:h800:1 --mail-type=ALL --pty bash
```
And note the hostname of the GPU server assigned, e.g., gpucluster-g1, either from the command prompt, or using the 'hostname' command.
Activate the conda environment that have SGLang installed in the previous session:
```
conda activate deekseek
```

Start the server:

python3 -m sglang.launch_server --served-model-name deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
--model-path /share/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --trust-remote-code

After the SGLang server is started up, a message will be shown that it is running on http://127.0.0.1:30000.

Open a new terminal on your local computer. Login the same GPU server, using the hostname you noted in step 1:
```
ssh <your_usenname>@gpucluster-g1.cs.hku.hk
```
On this new SSH session, query the model name:
```
curl http://127.0.0.1:30000/v1/models
```
The id of the model should be the same as the --served-model-name parameter in the previous step.

Asked the server a question, e.g.,

curl http://localhost:30000/v1/completions -H "Content-Type: application/json" \
-d '{ "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B", "prompt": "List some interesting facts in Mathematics about the number 2025", "max_tokens": 1024, "temperature": 0.6 }'

Running Large DeekSeek R1 models with multiple H800 GPU

DeepSeek-R1-Distill-Llama-70B needs two H800 GPUs to run.

On gpu3gate1.cs.hku.hk, request a session with two H800 GPUs
```
srun -p q-h800 --gres=gpu:h800:2 --mail-type=ALL --pty bash
```
And note the hostname of the GPU server assigned, e.g., gpucluster-g1, either from the command prompt, or using the 'hostname' command.
Activate the conda environment that have SGLang installed in the previous session:
```
conda activate deekseek
```

Start the server with 2 GPUs (--tp 2):

python3 -m sglang.launch_server --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
--model-path /share/deepseek-ai/DeepSeek-R1-Distill-Llama-70B --trust-remote-code --tp 2

After the SGLang server is started up, a message will be shown that it is running on http://127.0.0.1:30000.

Open a new terminal on your local computer. Login the same GPU server, using the hostname you noted in step 1:
```
ssh <your_usenname>@gpucluster-g1.cs.hku.hk
```
On this new SSH session, query the model name:
```
curl http://127.0.0.1:30000/v1/models
```
The id of the model should be the same as the --served-model-name parameter in the previous step.

Asked the server a question, e.g.,

curl http://localhost:30000/v1/completions -H "Content-Type: application/json" \
-d '{ "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B", "prompt": "Write a python program to display Hello World.", "max_tokens": 1024, "temperature": 0.6 }'

GPU Farm for Research (Phase 3)

HKUCDS GPU Farm for Research (Phase 3)

Sessions on this page

Introduction

Running a Session with one GPU

Specifying the longer time limit

Running a Session with one H800 GPU

Running a Session with 2 H800 GPUs

Submitting Batch Jobs

Usage Examples with DeekSeek R1 models

Installing SGLang

Running DeepSeek R1 models with one RTX4090 GPU

Running DeekSeek R1 models with one H800 GPU

Running Large DeekSeek R1 models with multiple H800 GPU

Sign in to your account