Cluster User Policy – UTSC Psychology Computing Support

Psychology Research Computing Cluster, UTSC User Policy

Background
The Psychology Computing Cluster at UTSC was established using an NSERC Research Tools and Infrastructure grant and funds from the UTSC Office of the Vice-Principal Academic and Dean in 2015. An upgrade of the cluster was made possible by a Departmental Research Fund grant from the UTSC Office of the Vice-Principal Research and Innovation in 2020. The cluster is housed at Information and Instructional Technology Services (IITS), UTSC and is currently composed of 8 x Dell PowerEdge R430 CPU nodes, 1 x R740 CPU node, 2 x R740 GPU nodes, 2x R760 GPU nodes and 1x R740 GPU debug node. The cluster has two head nodes, a master head node – neurocomp0, and a backup head node – neurocomp00. Both the head nodes are virtual machines from IITS. Each R430 compute node possesses 2 x Intel Xeon E5-2640 v3 2.6GHz processor with 8 cores / 16 threads and 64GB RAM. The R740 CPU node possesses 2 x Intel Xeon Platinum 8268 2.9GHz processor with 24 cores / 48 threads and 384GB RAM. Each R740 GPU node possesses 2 x Intel Xeon Gold 6238R 2.2GHz processor with 28 cores / 56 threads, 384G RAM and 1 x NVIDIA Quadro RTX 8000 GPU card with 48 GB GPU memory and 4608 Cuda cores. The third R740 GPU server was originally purchased for debugging purposes, which now has 96GB CPU memory, 32 CPU cores, and 2 x Ampere A40 GPU cards with 48GB GPU memory and 10752 Cuda cores each. Each R760 GPU node possesses 2 x Intel Xeon Gold 5420+ 2.0GHz processor with 28 cores / 56 threads, 256G RAM and 2 x Ampere A40 GPU cards with 48GB GPU memory and 10752 Cuda cores each. The cluster is currently running Ubuntu Server 22.04, and Slurm 22.05.6. Each compute node has Matlab (latest version 2024a) installed, and Anaconda 3 and Anaconda 2 installed with Tensorflow, Pytorch and more. Neuroimaging and computational modelling packages include FSL, SPM, HDDM and more. New storage hardware is currently being set up to provide a total of over 300TB for computing, backup and archive space.

Governance and decision making
Oversight of the cluster is currently provided by the Chair of the Department of Psychology, the Associate Chair Research of the Department of Psychology, and the Computational Research Support Specialist. All Principle Investigator users are consulted on major issues and decisions pertaining to system upgrades, maintenance, ongoing operations, and policies.

Support
Support is provided by the Computational Research Support Specialist (Weijun Gao), whose responsibilities include:

Overseeing hardware repairs and upgrades.
Installing and maintaining software.
Monitoring system performance and usage.
Providing training (e.g. tutorials and demonstrations on a group or one-to-one basis).
User account management.
Trouble-shooting and code debugging.

As per IITS policies, a backup cluster administrator will provide support in the event of time sensitive emergencies and unavailability of the Computational Research Support Specialist (contact details of backup administrator will be released when necessary). A technical support website (which provides system status information and useful tips) can be accessed at https://psycomp.utsc.utoronto.ca/support/.

User accounts
To open a user account, individuals are required to submit a signed cluster user registration form (available from the technical support website). By default, an individual’s username is their UTORid and authentication is provided by a UTSC authentication server so that users can use their UTORid login details to access the cluster. A local Linux account will be created if a user’s UTOR access with UTSC isn’t available. By default, each user account comes with 500 GB of storage quota. The home quota can be increased on request and alternative computing storage is available when needed. When an individual no longer requires access to the cluster, their data will be archived (see ‘Data storage and backup’).

User fees
User fees are necessary to cover the cost of software licenses (i.e. Matlab) and system maintenance/upgrades. Current fees are $300/year for faculty/post-doc/graduate student accounts and $75/year for undergraduate student accounts. Fees are taken at the start of the academic year in September following approval by each Principle Investigator, and fees for accounts that are opened at other times during the academic year will be prorated. Fees may also be reduced or waived on a case-by-case basis should research funds be a serious concern for a Principle Investigator (e.g. a Principle Investigator is between grants).

Data storage and backup
Computing storage, backup storage and archive storage are all RAID storages. Incremental backups of computing data are conducted on a weekly basis, with the three most recent copies kept on a rolling basis. Once the new storage hardware is installed, the 12 most recent weekly backups will be kept. Archived data are kept indefinitely and will be mirrored in the new storage system. All users are encouraged to clean up their data regularly, delete superfluous files, and archive data that are not needed. Requests for additional computing data storage space will not be granted until users can demonstrate that they have outgrown their existing quota and done their best to clean up their data. All users are requested to document their data and scripts (e.g. for archiving purposes) before leaving the department.

Usage policy
Optimal running of the cluster relies on all users being ‘good citizens’. To facilitate this, a queue system, SLURM, has been implemented and ALL jobs must be submitted to this to be executed. Users should consult the technical support website for details on how to use SLURM and which partitions (i.e. cpu, gpu, gpudebug) are most appropriate for their jobs.

CPU – during times of high usage, each user will use no more than 32 CPU cores. When system usage is low, each user can use a maximum of 96 CPU cores. However, they must be prepared to immediately reduce their CPU core usage back down to 32 CPU cores should there be an increase in system demand. In the latter scenario, checkpointing is highly recommended (see ‘Checkpointing’). Users are strongly encouraged to monitor cluster usage regularly on the technical support website and if in doubt, to contact the Computational Research Support Specialist.
GPU – Each user can use 1 GPU maximum and should not submit more than one job at a time. The “gpu” and “gpudebug” partitions have a shorter wall-time. Users can submit CPU-only jobs to the “gpu” partition; however, a shell/Perl script will be developed to monitor the “gpu” partition usage and adjust job priorities for fairness and resource usage efficiency.
Resource allocation limits will be adjusted according to available resource changes and computing need changes in the future.

Lab-owned servers
Principle Investigators are welcome to integrate their own servers with the cluster so that they can be used by other researchers and managed by the Computational Research Support Specialist. A number of conditions must, however, be agreed on in advance:

The server will be integrated into the cluster SLURM system and made available to all cluster users equally (i.e. no users will have priority access).
The software environment of the server will be identical to that of the other cluster compute nodes. Specific software packages required by the joining server will be installed across the whole cluster.
For data security and cluster consistency purposes, only the Computational Research Support Specialist and the backup administrator will have admin/root access to the server.

Checkpointing
Application-level checkpointing is recommended if it is available with the software package(s) that a job is using. System-level checkpointing is possible but requires manual restarting and root access – this should be used only when absolutely necessary and should be implemented with the help of the Computational Research Support Specialist.