Using the HKU CS GPU Farm (Quick Start)

Applying for an Account

GPU Farm for Teaching (Phase 2) Accounts

Members of the School of Computing and Data Science and students taking designated courses offered by the School are eligible to use GPU Farm for Teachng. Please visit https://intranet.cs.hku.hk/gpufarm_acct_cas/ for application.

The usename of the account will be the same as your HKU Portal ID. A new password will be set for the account. An email will be sent to you after your account is created.

GPU Farm for Research (Phase 3) Accounts

The following users are also eligible to use GPU Farm Phase 3:

staff of the School of Computing and Data Science;
PhD and MPhil students of the School of Computing and Data Science;

Please visit https://intranet.cs.hku.hk/gpufarm3_acct/ for application and https://www.cs.hku.hk/gpu-farm/gpu-farm-for-research for usage information.

Accessing the GPU Farms

To access the GPU farm, you need to be connected to HKUVPN (from either the Internet or HKU Wifi), or to the HKU CDS wired network (CDS laboratories and offices). Use SSH to connect to one of the gateway nodes:

Gateway nodes for GPU Farm Phase 2: gpu2gate1.cs.hku.hk or gpu2gate2.cs.hku.hk
Gateway nodes for GPU Farm Phase 3: gpu3gate1.cs.hku.hk

Note: The accounts and home directories of the the 2 phases of GPU farm are separate from each other and are not shared.

Login with your username and password (or your SSH key if you have uploaded your public key during account application), e.g.:

For phase 2:

ssh <your_portal_id>@gpu2gate1.cs.hku.hk

For phase 3:

ssh <your_portal_id>@gpu3gate1.cs.hku.hk

Note: Users of Linux, including WSL2, may add "-X' as command line option to enable of X11 forwarding.

These gateway nodes provide access to the actual GPU compute nodes of the farm. You can also transfer data to your home directory of the GPU farm by using SFTP to these the gateway nodes.

To facilitate X11 forwarding in interactive mode on the GPU compute nodes, an SSH key pair (id_rsa and id_rsa.pub) and the authorized_keys file are generated in the ~/.ssh directory when your GPU farm account is created. You are free to replace the key pair with your own one, and add your own public keys to the ~/.ssh/authorized_keys file.

Using GPUs in Interactive Mode

After logging on a gateway node, you can now login to a computation node with actual GPUs attached. To have an interactive session, use the gpu-interactive command to run a bash shell on a GPU node. An available GPU compute node will be selected and allocated to you, and you will be logged on the node automatically. The GPU compute nodes are named gpu-xxxx-yy in Phase 2. Note the change of host name in the command prompt when you actually log on to a GPU node, e.g.,

tmchan@gpu2gate1:~$ gpu-interactive 
tmchan@gpu-4080-103:~$

You can verify that a GPU is allocated to you with the nvidia-smi command, e.g.:

tmchan@gpu-408-103:~$ nvidia-smi

Fri Jun 20 16:45:00 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01             Driver Version: 550.163.01     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4080 ...    Off |   00000000:06:00.0 Off |                  N/A |
| 38%   32C    P8              6W /  320W |       2MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

With the gpu-interactive command, 1 GPU, 4 CPU cores and 40 GB RAM are allocate to you.

You can now install and run software like using a normal Linux server.

Note that you do not have sudo privileges. Do not use commands such as 'sudo pip' or 'sudo apt-get' to install software.

The time limits (quotas) for GPU and CPU time start counting once you have logged on to a GPU compute node, until you logout the GPU compute node:

tmchan@gpu-4080-103:~$ exit
tmchan@gpu2gate1:~$

All your processes running on the GPU node will be terminated when you exit from gpu-interactive command.

Accessing Your Session with Another Terminal

After you are allocated a GPU compute node with gpu-interactive, you may access the same node with another SSH session. What you need is the actual IP address of the GPU compute node you are in. Run 'hostname -I' on the GPU compute to node find out its IP address. The output will be an IP address 10.XXX.XXX.XXX, e.g.,

tmchan@gpu2gate1:~$ hostname -I 10.21.5.225

Then using another terminal on your local desktop/notebook, SSH to this IP address:

ssh -X <your_cs_username>@10.XXX.XXX.XXX

These additional SSH sessions will terminate when you exit the gpu-interactive command.

Note: Do not use more than one gpu-interactive (or srun) at the same time if you just want to access your current GPU session from a second terminal, since those commands will start a new independent session and allocate an additional GPU to you, i.e., your GPU time quota will be doubly deducted. Also, you cannot access the GPUs of your previous sessions.

Software Installation

After logging on a GPU compute node using the gpu-interactive command (or SLURM's native srun command), you can install software, such as Anaconda, for using GPUs into your home directory,

GPU driver software is pre-installed on all GPU compute nodes. (Note: If you software reported that no GPU is detected, it is probably because you are still on the gateway node and have not logged on a GPU node yet.)

Many software packages for AI and machine learning can be installed as a orginary user without sudo previleges. You may install software on your account with package managers such as Anaconda or pip, or by compiling from the soruce code. Running 'sudo and 'apt' are not supported.

Examples

Below are some example steps of software installation:

Note: make sure you are in a GPU compute node (the host name of prompt shows gpu-comp-x) before installing and running your software.

Anaconda (including Jupyter)

# if you are on a gateway node, login a GPU node first
gpu-interactive
# download installer, check for latest version from www.anaconda.com
wget https://repo.anaconda.com/archive/Anaconda3-2024.10-1-Linux-x86_64.sh
# run the installer,
# and allow the installer to update your shell profile (.bashrc) to automatically initialize conda
bash Anaconda3-2024.10-1-Linux-x86_64.sh
# logout and login the GPU node again to activate the change in .bashrc 
exit 
# run gpu-interactive again when you are back to the gateway node 
gpu-interactive

Note: if you have chosen not to allow the Anaconda installer to update your .bashrc, the 'conda' command will not be available. It can be fixed by running

~/anaconda3/bin/conda init

then logout and re-login

Install Pytorch in a dedicated Conda environment

# If you are on a gateway node, login a GPU node first
gpu-interactive

# Create a new environment, you may change python version 3.11 to other versions if needed
conda create -n my_env python=3.11

# Activate the new environment
conda activate my_env

#Then use a web browser to visit https://pytorch.org/. Scroll down to the INSTALL PYTORCH session, select
#Your OS: Linux
#Package: Conda
#Language: Python
#Compute Platform: CUDA 11.x or CUDA 12.x
#Then run the command displayed in Run this command, e.g.,
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Install Jupyter Kernel for an environment

# Suppose your conda environment is named my_env. To use Jupyter Lab within the environment, install iPython kernel
conda activate my_env
conda install ipykernel
ipython kernel install --user --name=kernel_for_my_env
# restart Jupyter and then choose kernel_for_my_env in Jypyter Lab

Using CUDA

When you install common tools such as PyTorch or TensorFlow, the installation instruction includes steps that installs supporting CUDA runtime libraries. Usually there is no need to install or compile CUDA toolkit separately.

In case a separate CUDA toolkit is needed, it is available in /usr/local/cuda of all GPU nodes. To avoild conflict, CUDA is not added to the PATH variable of user accounts by default. If you need to develop with CUDA (e.g., using nvcc), you can add the following line at the your ~/.bashrc, e.g.

PATH=/usr/local/cuda/bin:$PATH

Running Jupyter Lab without Starting a Web Browser

Running jyputer-lab starts a web browser by default. While it is convenient when the software is run on a local computer, running a web browser on a compute node of the GPU farm not only consumes the memory and CPU power of the your session, the responsiveness of the web browser will also degrade, especially you are connecting remotely from outside of HKU. We recommend running jyputer-lab on a GPU compute node without starting a web browser, and access it with the web browser of your local computer. The steps below shows the way to do it:

Login a GPU compute node from a gateway node with gpu-interactive:
gpu-interactive
Find out the IP address of the GPU compute node:
hostname -I
(The output will be an IP address 10.XXX.XXX.XXX)
Start Jupyter Lab with the --no-browser option and note the URL displayed at the end of the output:
jupyter-lab --no-browser --FileContentsManager.delete_to_trash=False
The output will look like something below:
...
Or copy and paste one of these URLs:
http://localhost:8888/?token=b92a856c2142a8c52efb0d7b8423786d2cca3993359982f1

Note the actual port no. of the URL. It may sometimes be 8889, 8890, or 8891, etc.
On your local desktop/notebook computer, start another terminal and run SSH with port forwarding to the IP address you obtained in step 2:
ssh -L 8888:localhost:8888 <your_gpu_acct_username>@10.XXX.XXX.XXX
(Change 8888 to the actual port no. you saw in step 3.)
Notes:
1. The ssh command in this step should be run on your local computer. Do not login the gateway node.
2. If you see an error like the following:
bind [127.0.0.1]:7860: Address already in use
channel_setup_fwd_listener_tcpip: cannot listen to port: 7860
Could not request local forwarding.
you may have a Jypyter Lab instance running on your local computer. Close your local Jupyter Lab instance, and restart the ssh command in this step. If in doubt, restart your local computer.
On your local desktop/notebook computer, start a web browser. Copy the URL from step 3 to it.

Remember to shutdown your Jupyter Lab instance and quit your gpu-interacive session after use. Leaving a Jupyter Lab instance idle on a GPU node will exhaust your GPU time quota.

Notes on file deletion: If you start Jupyter Lab without --FileContentsManager.delete_to_trash=False, files deleted with Jupyter Lab, will be moved to Trash Bin (~/.local/share/Trash) instead of deleted actually. Your disk quota may be used up by the Trash Bin eventally. To empty trash and release the disk space used, use the following command:

rm -rf ~/.local/share/Trash/*

Using tmux for Unstable Network Connections

To avoid disconnection due to unstable Wifi or VPN, you may use the tmux command, which can keep a terminal session running even when disconnected, on gpu2gate1 or gpu2gate2.

Note that tmux should be run on gpu2gate1 or gpu2gate2. Do not run tmux on a GPU node after running gpu-interative or srun. All tmux sessions on a GPU node will still be terminiated when you gpu-interative/srun session ends.

There are many on-line tutorials on the web showing how to use this command, e.g.,
https://medium.com/actualize-network/a-minimalist-guide-to-tmux-13675fb160fa

Please see these pages for details, especially on the use of detach and attach functions.

Cleaning up files to Free up Disk Space

The following guidelines help to free up disk space when you are running out of disk quota:

Emptying Trash. When you run Jupyter Lab without --FileContentsManager.delete_to_trash=False option, or other GUI to manipulate files, files you try to delete will be moved to Trash Bin (~/.local/share/Trash) instead of deleted actually. To empty trash and release the disk space used, use the following command:
rm -rf ~/.local/share/Trash/*
Remove installation files after sofware installation. For example, the Anaconda installation file Anaconda3-20*.*-Linux-x86_64.sh, which has a size of over 500MB, can be deleted after installation. Also check if you have downloaded the files multiple times and delete redundent copies.
Clean up conda installation packages. Run the following command to remove conda installation files cached in your home directory:
conda clean --tarballs
Clean up cached pip installation packages:
pip cache purge
Clean up intermediate files generated from the software packages you are using. Study the documentations of individual packages for details.

Further Information

See the Advanced Use page for information on using multiple GPU in a single session and running batch jobs, and the official site of the SLURM Workload Manager for further documentation on using SLURM.

You may also contact support@cs.hku.hk for other questions.