Running on HPC Systems
Running tasks on a high performance computing (HPC) systems is different from running tasks on a personal computer. You can either run tasks interactively on a compute node, or submit tasks as batch jobs to a queue. The former is useful for development (e.g. exploratory data analysis, testing models, troubleshooting), while the latter is useful for production runs (e.g. training models at scale). By the end of this section, you should be able to:
Start an interactive session on a compute node
Describe the components of a SLURM batch job submission file
Use a Linux text editor such as VIM
Submit a batch job to a queue
Manage software modules
Run Interactively with idev
The idev utility initiates an interactive session on one or more compute nodes so that you can
issue commands and run code as if you were doing so on your personal machine. An interactive session
is useful for development, as you can quickly compile, run, and validate code. This is accomplished
by simply executing idev on any of the TACC systems.
To learn about the command line options available for idev, use idev -help.
[frontera]$ idev --help
...
OPTION ARGUMENTS DESCRIPTION
-A account_name sets account name (default: in .idevrc)
-m minutes sets time in minutes (default: 30)
-p queue_name sets queue to named queue (default: -p development)
-r resource_name sets hardware
-t hh:mm:ss sets time to hh:mm:ss (default: 00:30:00)
-help [--help ] displays (this) help message
-v [--version ] output version information and exit
Let’s go over some of the most useful idev command line options that can customize your
interactive session:
To change the time limit from the default 30 minutes, users can apply the
-mcommand line option. For example, a user requesting an interactive session for an hour would use the command line option-m 60.To change the allocation_name associated with the interactive session, users can use the
-Acommand line option. This option is useful for when a user has multiple allocations they belong to. For example, if I have allocations on accountsTACCandTraining, I can use-Ato set the allocation I want to be used like so:-A TACCor-A Training.To change the queue to be different than the default
developmentqueue, users can use the-pcommand line option. For example, if a user wants to launch an interactive session on one of Frontera GPU nodes, they would use the command line option-p rtxor-p rtx-dev. You can learn more about the different queues of Frontera here.
Note
For the scope of this section, we will be using the default development queue.
To start a thirty-minute interactive session on a compute node in the development queue with our
EXAMPLE allocation:
[frontera]$ idev -A EXAMPLE
Note
For this workshop, using allocation name Frontera-Training instead of “EXAMPLE”
If launch is successful, you will see output that includes the following excerpts:
...
-----------------------------------------------------------------
Welcome to the Frontera Supercomputer
-----------------------------------------------------------------
...
-> After your idev job begins to run, a command prompt will appear,
-> and you can begin your interactive development session.
-> We will report the job status every 4 seconds: (PD=pending, R=running).
-> job status: PD
-> job status: R
-> Job is now running on masternode= c205-004...OK
...
c205-004[clx](633)$
EXERCISE
Note
In case you didn’t donwload the code in the previous section, you can download by doing:
[frontera]$ cdw
[frontera]$ wget https://github.com/TACC/life_sciences_ml_at_tacc/raw/refs/heads/main/docs/section1/files/firststeps.zip
[frontera]$ unzip firststeps.zip
Let’s execute some Python code that determines the larger of two numbers. The code also includes a
3-second delay before finishing. Note, the [clx] syntax in the prompt refers to an interactive
session that has been started on a Frontera Cascade Lake node.
[clx]$ cdw
[clx]$ unzip firststeps.zip
[clx]$ cd Lab01
[clx]$ pwd
/work2/03302/lconcia/frontera/Lab01
[clx]$ ls
example.slurm example_template.slurm my_code.py
Load the appropriate modules, and run my_code.py.
[clx]$ module load python3
[clx]$ python3 my_code.py
The larger number of 51 and 20 is 51
You can check the files that were generated using ls, and see the contents of the file with
cat.
[clx]$ ls
duration.txt example.slurm example_template.slurm my_code.py
[clx]$ cat duration.txt
Done in 3.009739637374878 seconds.
To exit an interactive session, you can use the command logout.
Run Non-Interactively with sbatch
As we discussed before, on Frontera there are login nodes and compute nodes.
We cannot run the applications we need for our research on the login nodes because they are designed as a prep area, where you may edit and manage files, compile code, perform file management, issue transfers, submit new and track existing batch jobs etc. The login nodes provide an interface to the “back-end” compute nodes, where actual computations occur and where research is done.
To run a job, instead, we must write a short text file containing a list of the resources we need, and containing the job command(s). Then, we submit that text file to a queue to run on compute nodes. This process is called batch job submission.
There are several queues available on Frontera. It is important to understand the queue limitations
and pick a queue that is appropriate for your job. Documentation can be found
here.
Today, we will be using the development queue which has a max runtime of 2 hours, and users can
only submit one job at a time.
First, navigate to the Lab01 directory where we have an example job script prepared, called
example_template.slurm:
[frontera]$ cdw
[frontera]$ cd Lab01
[frontera]$ cat example_template.slurm
#!/bin/bash
#----------------------------------------------------
# Example SLURM job script to run applications on
# TACCs Frontera system.
#
# Example of job submission
# To submit a batch job, execute: sbatch example.slurm
# To show all queued jobs from user, execute: showq -u
# To kill a queued job, execute: scancel <jobId>
#----------------------------------------------------
#SBATCH -J # Job name
#SBATCH -o # Name of stdout output file (%j expands to jobId)
#SBATCH -e # Name of stderr error file (%j expands to jobId)
#SBATCH -p # Queue (partition) name
#SBATCH -N # Total number of nodes (must be 1 for serial)
#SBATCH -n # Total number of threas tasks requested (should be 1 for serial)
#SBATCH -t # Run time (hh:mm:ss), development queue max 2:00:00
#SBATCH --mail-user=your_email@domaim.com # Address email notifications
#SBATCH --mail-type=all # Email at begin and end of job
#SBATCH -A # Project/Allocation name (req'd if you have more than 1)
# Everything below here should be Linux commands
Frontera Production Queues
Here, we are comparing the differences between two queues: development and normal.
For information about other queues, please refer to the
Frontera Production Queues.
Queue Name |
|
|
|---|---|---|
Min-Max Nodes per Job (assoc’d cores) |
1-40 nodes (2,240 cores) |
3-512 nodes (28,672 cores) |
Max Job Duration |
2 hrs |
48 hrs |
Max Nodes per User |
40 nodes |
1836 nodes |
Max Jobs per User |
1 job |
100 jobs |
Charge Rate per node-hour |
1 SU |
1 SU |
Note
If you submit a job requesting 48 hrs in the normal queue, and it takes a total of 10 hrs to run, you will be charged as follows:
SUs charged = (Number of nodes) X (job wall-clock time) X (charge rate per node-hour).
SUs charged = (Number of nodes) X 10 X 1.
GPUs Available at TACC
Users frequently need to access GPUs to accelerate their machine learning workloads. Here a summary of GPUs available at TACC.
System |
GPU Nodes |
# |
GPUs per node |
Queues |
|---|---|---|---|---|
Lonestar6 |
A100 |
84 |
3x NVIDIA A100 |
gpu-a100 |
gpu-a100-dev |
||||
gpu-a100-small |
||||
H100 |
4 |
2x NVIDIA H100 |
gpu-h100 |
|
Stampede3 |
Ponte Vecchio |
20 |
4x Intel Data Center Max 1550s |
pvc |
Frontera |
90 |
4x NVIDIA Quadro RTX 5000 |
rtx |
|
rtx-dev |
||||
Vista |
Grace Hopper |
600 |
1x NVIDIA H200 GPU |
gh |
gh-dev |
Executing Basic Job Management Tasks
First, we must know an application we want to run, and a research question we want to ask. In this example, we aim to execute a Python code that determines the larger of two numbers. The code also includes a 3-second delay before finishing.
[frontera]$ cdw
[frontera]$ cd Lab01
[frontera]$ pwd
/work2/03302/lconcia/frontera/Lab01
[frontera]$ ls
example.slurm example_template.slurm my_code.py
Next, we need to fill out example_template.slurm to request the necessary resources. I know that
this code will take a little more than 3 seconds, so I can reasonably predict how much we will need.
When running your first jobs with your applications, it will take some trial and error, and reading
online documentation, to get a feel for how many resources you should use. Open example_template.slurm
with VIM and fill out the following information:
#!/bin/bash
#----------------------------------------------------
# Example SLURM job script to run applications on
# TACCs Frontera system.
#
# Example of job submission
# To submit a batch job, execute: sbatch example.slurm
# To show all queued jobs from user, execute: showq -u
# To kill a queued job, execute: scancel <jobId>
#----------------------------------------------------
#SBATCH -J first_job # Job name
#SBATCH -o output.%j # Name of stdout output file (%j expands to jobId)
#SBATCH -e error.%j # Name of stderr error file (%j expands to jobId)
#SBATCH -p development # Queue (partition) name
#SBATCH -N 1 # Total number of nodes (must be 1 for serial)
#SBATCH -n 1 # Total number of threas tasks requested (should be 1 for serial)
#SBATCH -t 0:30:00 # Run time (hh:mm:ss), development queue max 2:00:00
#SBATCH -A Frontera-Training # Project/Allocation name (req'd if you have more than 1)
# Everything below here should be Linux commands
module load python3
python3 my_code.py
The way this job is configured, it will load the appropriate modules, and run my_code.py.
Text Editing with VIM
VIM is a text editor used on Linux file systems.
Open the file example_template.slurm:
[frontera]$ vim example_template.slurm
There are two “modes” in VIM that we will talk about today. They are called “insert mode” and
“normal mode”. In insert mode, the user is typing text into a file as seen through the terminal
(think about typing text into TextEdit or Notepad). In normal mode, the user can perform other
functions like save, quit, cut and paste, find and replace, etc. (think about clicking the menu
options in TextEdit or Notepad). The two main keys to remember to toggle between the modes are i
and Esc.
Entering VIM insert mode:
> i
Entering VIM normal mode:
> Esc
A summary of the most important keys to know for normal mode are (more on your cheat sheet):
# Navigating the file:
arrow keys move up, down, left, right
Ctrl+u page up
Ctrl+d page down
0 move to beginning of line
$ move to end of line
gg move to beginning of file
G move to end of file
:N move to line N
# Saving and quitting:
:q quit editing the file
:q! quit editing the file without saving
:w save the file, continue editing
:wq save and quit
- For more information, see:
Or type on the command line:
vimtutor
Submit a Batch Job to the Queue
Once you have filled in the job description, save and quit the file. Submit the job to the queue
using the sbatch command`:
[frontera]$ sbatch example_template.slurm
To view the jobs you have currently in the queue, use the showq or squeue commands:
[frontera]$ showq -u # shows my jobs
[frontera]$ showq # shows all jobs by all users
[frontera]$ squeue -u $USERNAME # shows my jobs
[frontera]$ squeue # shows all jobs by all users
If for any reason you need to cancel a job, use the scancel command with the 6- or 7-digit
jobid:
[frontera]$ scancel jobid
For more example scripts, see this directory on Frontera:
[frontera]$ ls /share/doc/slurm/
If everything went well, you should have a file named duration.txt,
an output file named something similar to output.o6146935,
and an error file named something similar to error.o6146935 in the same directory as the
example_template.slurm script.
[frontera]$ ls
duration.txt error.6146935 example.slurm example_template.slurm my_code.py output.6146935
Congratulations! You ran a batch job on Frontera!
Managing Modules
Modules enable users to run specific applications or access libraries without the need to log out and back in. Modules for applications adjust the user’s path for easy access, while those for library packages set environment variables indicating the location of library and header files. Switching between package versions or removing a package is straightforward.
Tour of the Module Command
The module command sets the appropriate environment variable independent of the user’s shell. Typically the system will load a default set of modules. A user can list the modules loaded by:
[frontera]$ module list
Currently Loaded Modules:
1) intel/19.1.1 3) git/2.24.1 5) python3/3.7.0 7) hwloc/1.11.12 9) TACC
2) impi/19.0.9 4) autotools/1.2 6) cmake/3.24.2 8) xalt/2.10.34
To find out available modules for loading, a user can use:
[frontera]$ module avail
- or -
[frontera]$ module spider
Press the <Enter> key to scroll through line-by-line, or the <Space> key to scroll through
page-by-page. Press q to quit the view.
If there are many modules on a system, it can be difficult to see what modules are available to load. To display a concise listing:
[frontera]$ module overview
----------------- /opt/apps/intel19/impi19_0/python3_7/modulefiles -----------------
boost-mpi (2)
--------------------- /opt/apps/intel19/python3_7/modulefiles ----------------------
boost (2)
---------------------- /opt/apps/intel19/impi19_0/modulefiles ----------------------
Rstats (2) gromacs (4) openfoam (4) remora (2)
adcirc (1) hpctoolkit (2) opensees (4) rosetta (1)
adios2 (1) hypre (16) p3dfft++ (1) slepc (39)
amask (1) ipm (1) p4est (2) suitesparse (10)
arpack (1) kokkos (4) parallel-netcdf (4) sundials (9)
aspect (1) lammps (4) parmetis (1) superlu (12)
This shows the short name of the module (i.e. git, or Rstats) and the number in the parentheses is the number of versions for each. This list above shows that there is one version of git and two versions of Rstats.
To check all the versions of a package (e.g., Rstats):
[frontera]$ module avail Rstats
--------------- /opt/apps/intel19/impi19_0/modulefiles ---------------
Rstats/3.6.3 Rstats/4.0.3 (D)
In Rstats/4.0.3 (D), the (D) denotes the default module. When loading packages, if you don’t
specify the version, the default module will be loaded. To load packages a user simply does:
[frontera]$ module load package1 package2 ...
To unload packages a user does:
[frontera]$ module unload package1 package2 ...
Modules can contain help messages. To access a module’s help do:
[frontera]$ module help packageName
To get a list of all the commands that module knows about do:
[frontera]$ module help
Review of Topics Covered
VIM
Command |
Effect |
|---|---|
|
open “file.txt” and edit with |
|
toggle to insert mode |
|
toggle to normal mode |
|
navigate the file |
|
quit ending the file |
|
quit editing the file without saving |
|
save the file, continue editing |
|
save and quit |
SLURM
Command |
Effect |
|---|---|
|
submit batch job to the queue |
|
show all jobs by all users |
|
show the current user’s jobs |
|
show all jobs by all users |
|
show user USERNAME’s jobs |
|
cancel a job given a job ID |
Modules
Command |
Effect |
|---|---|
|
List currently loaded modules |
|
See what modules are available |
|
See what modules are available (concise) |
|
Search for module “name” |
|
Load module “name” |
|
Unload module “name” |
|
Show module “name” help |
|
Show module command help |