The School of Computing and Data Science (https://www.cds.hku.hk/) was established by the University of Hong Kong on 1 July 2024, comprising the Department of Computer Science and Department of Statistics and Actuarial Science and Department of AI and Data Science.

GPU Farm for Research (Phase 3)

HKUCDS GPU Farm for Research (Phase 3)

(The reader of this page is assumed to have read the Quick Start and Advanced Use pages.)

Sessions on this page

Introduction

HKUCDS GPU Farm for Research (Phase 3) is a SLURM cluster of servers with NVIDIA RTX4090(24GB) and H800 SXM (80GB) GPUs. It is available to staff, PhD and MPhil students of the School of Computing and Data Science.

User account can be applied at  https://intranet.cs.hku.hk/gpufarm3_acct/. After your account is created, you may login the gateway node gpu3gate1.cs.hku.hk with SSH:

ssh <your_username>@gpu3gate1.cs.hku.hk 

Note: the user accounts and home directories are independent of Phase 1 and 2 and are not shared.

Running a Session with one GPU

The following SLURM partitions are defined in HKUCDS GPU Farm for Research:

Partition GPU type No. of GPUs per server Default CPU cores per GPU Default Server RAM per GPU Default Time Limit Maximum Time Limit Remarks
debug (default) RTX4090 (24GB) 2,8,10 4 96GB 6 Hours 7 Days  
q-h800 H800 (80GB) 8 4 240GB 6 Hours 2 Days at most one job only at a time 
q-hgpu-batch H100/H800 (80GB) 8 4 240GB 2 Days 7 Days sbatch jobs only

 

After logging on the gateway node, a GPU session can be started with srun, e.g.,

srun --gres=gpu:1 --mail-type=ALL --pty bash 

The default SLURM queue (debug) allocates RTX4090 GPUs. 4 CPU cores and 96GB system RAM is allocated with each GPU. To have a session with 2 GPUs:

srun --nodes=1 --gres=gpu:2 --mail-type=ALL --pty bash

By default, each user account can request up to 4 GPUs concurrently. The limit can be raised on request.

Specifying the longer time limit

A job will be terminated when its time limited is reached. Use '-t' to specify a longer time limit than the default. For example, to have a time limit of 12 hours:

srun --nodes=1 --gres=gpu:2 -t 12:00:00 --mail-type=ALL --pty bash 

Running a Session with one H800 GPU

To get a session with a H800 GPU, use the q-h800 partition by adding '-p q-h800' in srun or sbatch, e.g.,

srun -p q-h800 --gres=gpu:1 --mail-type=ALL --pty bash

4 CPU cores and 240GB system RAM is allocated with each H800 GPU.

Running a Session with 2 H800 GPUs

srun -p q-h800 --nodes=1 --gres=gpu:2 --mail-type=ALL --pty bash

Submitting Batch Jobs

If you program runs for days and does not require user interaction during execution, you can submit it to the system in a batch mode. The system will schedule your job to run when the requested GPUs are available.

To submit a batch job from the gateway node gpu3gate1, 

  1. Create a batch file, e.g., my-gpu-batch, with the following contents:
    #!/bin/bash

    # Tell the system the resources you need. Adjust the numbers according to your need

    # specify the partition to use and GPUs needed with -p and --gres optons, e.g.
    # '--gres=gpu:4' for four RTX4090 GPUs

    # '-p q-hgpu-batch --gres=gpu:2' for two H100 or H800 GPUs
    # '-p q-hgpu-batch --gres=gpu:h100:2 for two H100 GPUs

    # '-p q-hgpu-batch --gres=gpu:h800:4 for four H800 GPUs
    #SBATCH --nodes=1 --gres=gpu:4 --mail-type=ALL

    # Specify a time limit if needed, e.g., 4 days
    #SBATCH -t 4-00:00:00

    #If you use Anaconda, initialize it
    . $HOME/anaconda3/etc/profile.d/conda.sh
    conda activate my_env


    # cd your your desired directory and execute your program, e.g.
    cd _to_your_directory_you_need
    _run_your_program_

  2. Submit your batch job to the system with the following command on a .:
    sbatch my-gpu-batch

Usage Examples with DeekSeek R1 models

The following examples show how to install and run an OpenAI API server with DeepSeek R1 locally in the GPU farm. Distilled models of DeekSeek R1 are downloaded in /share/deepseek-ai for convenience.

Installing SGLang 

SGLang is a framework for large language models and vision language models. An OpenAI-compatible APIs server is included. The following steps assume that Anaconda is installed.

  1. On gpu3gate1.cs.hku.hk, request a GPU session
    srun --gres=gpu:1 --mail-type=ALL --pty bash
  2. On the GPU node, create a new conda enviroment:
    conda create -n deekseek python=3.10
  3. Activate the environment:
    conda activate deekseek
  4. Install SGLang (ref: https://docs.sglang.ai/start/install.html)
    pip install "sglang[all]>=0.4.4.post1" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
  5. Logout the session. The conda environment will be used for running the API server (see below). 

Running DeepSeek R1 models with one RTX4090 GPU

Smaller distilled models of DeepSeek R1, e.g., DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B can run with one RTX4090 GPU.

  1. On gpu3gate1.cs.hku.hk, request a GPU session
    srun --gres=gpu:1 --mail-type=ALL --pty bash

    And note the hostname of the GPU server assigned, e.g., gpu-4090-201, either from the command prompt, or using the 'hostname' command.

  2. Activate the conda environment that have SGLang installed in the previous session:
    conda activate deekseek
  3. Start the server using one DeepSeek-R1-Distill-Qwen-7B for example:
    python3 -m sglang.launch_server --served-model-name deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --model-path /share/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --trust-remote-code
    After the SGLang server is started up, a message will be shown that it is running on http://127.0.0.1:30000.

  4. Open a new terminal on your local computer. Login the same GPU server, using the hostname you noted in step 1:
    ssh <your_usenname>@gpu-4090-201.cs.hku.hk
  5. On this new SSH session, query the model name:
    curl http://127.0.0.1:30000/v1/models
    The id of the model should be the same as the --served-model-name parameter in the previous step.

  6. Asked the server a question, e.g., 
    curl http://localhost:30000/v1/completions -H "Content-Type: application/json" \
    -d '{ "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", "prompt": "Who are you?", "max_tokens": 1024, "temperature": 0.6 }'

 

Running DeekSeek R1 models with one H800 GPU

DeepSeek-R1-Distill-Qwen-32B cannot fit in an RTX4090, but can run in a single H800.

  1. On gpu3gate1.cs.hku.hk, request a H800 GPU session
    srun -p q-h800 --gres=gpu:h800:1 --mail-type=ALL --pty bash

    And note the hostname of the GPU server assigned, e.g., gpucluster-g1, either from the command prompt, or using the 'hostname' command.

  2. Activate the conda environment that have SGLang installed in the previous session:
    conda activate deekseek
  3. Start the server:
    python3 -m sglang.launch_server --served-model-name deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    --model-path /share/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --trust-remote-code
    After the SGLang server is started up, a message will be shown that it is running on http://127.0.0.1:30000.

  4. Open a new terminal on your local computer. Login the same GPU server, using the hostname you noted in step 1:
    ssh <your_usenname>@gpucluster-g1.cs.hku.hk
  5. On this new SSH session, query the model name:
    curl http://127.0.0.1:30000/v1/models
    The id of the model should be the same as the --served-model-name parameter in the previous step.

  6. Asked the server a question, e.g., 
    curl http://localhost:30000/v1/completions -H "Content-Type: application/json" \
    -d '{ "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B", "prompt": "List some interesting facts in Mathematics about the number 2025", "max_tokens": 1024, "temperature": 0.6 }'

Running Large DeekSeek R1 models with multiple H800 GPU

DeepSeek-R1-Distill-Llama-70B needs two H800 GPUs to run.

  1. On gpu3gate1.cs.hku.hk, request a session with two H800 GPUs
    srun -p q-h800 --gres=gpu:h800:2 --mail-type=ALL --pty bash

    And note the hostname of the GPU server assigned, e.g., gpucluster-g1, either from the command prompt, or using the 'hostname' command.

  2. Activate the conda environment that have SGLang installed in the previous session:
    conda activate deekseek
  3. Start the server with 2 GPUs (--tp 2):
    python3 -m sglang.launch_server --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
    --model-path /share/deepseek-ai/DeepSeek-R1-Distill-Llama-70B --trust-remote-code --tp 2
    After the SGLang server is started up, a message will be shown that it is running on http://127.0.0.1:30000.

  4. Open a new terminal on your local computer. Login the same GPU server, using the hostname you noted in step 1:
    ssh <your_usenname>@gpucluster-g1.cs.hku.hk
  5. On this new SSH session, query the model name:
    curl http://127.0.0.1:30000/v1/models
    The id of the model should be the same as the --served-model-name parameter in the previous step.

  6. Asked the server a question, e.g., 
    curl http://localhost:30000/v1/completions -H "Content-Type: application/json" \
    -d '{ "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B", "prompt": "Write a python program to display Hello World.", "max_tokens": 1024, "temperature": 0.6 }'

Division of AI & Data Science, School of Computing and Data Science
Rm 207 Chow Yei Ching Building
The University of Hong Kong
Pokfulam Road, Hong Kong
香港大學計算機科學系,人工智能與數據科學系
香港薄扶林道香港大學周亦卿樓207室

Email: aienq@hku.hk
Telephone: 3917 3146

Copyright © School of Computing and Data Science, The University of Hong Kong. All rights reserved.
Don't have an account yet? Register Now!

Sign in to your account