The Psychology Computing Cluster at UTSC was established using an NSERC Research Tools and Infrastructure grant and funds from the UTSC Office of the Vice-Principal Academic and Dean in 2015. An upgrade of the cluster was made possible by a Departmental Research Fund grant from the UTSC Office of the Vice-Principal Research and Innovation in 2020. The cluster is housed at Information and Instructional Technology Services (IITS), UTSC and is currently composed of 8 x Dell PowerEdge R430 CPU nodes, 1 x R740 CPU node and 2 x R740 GPU nodes. The cluster has two head nodes, a master head node – neurocomp0, and a backup head node – neurocomp00. Both the head nodes are virtual machines from IITS. Each R430 compute node possesses 2 x Intel Xeon E5-2640 v3 2.6GHz processor with 8 cores / 16 threads and 64GB RAM. The R740 CPU node possesses 2 x Intel Xeon Platinum 8268 2.9GHz processor with 24 cores / 48 threads and 384GB RAM. Each R740 GPU node possesses 2 x Intel Xeon Gold 6238R 2.2GHz processor with 28 cores / 56 threads, 384G RAM and 1 x NVIDIA Quadro RTX 8000 GPU card with 48 GB and 4608 Cuda cores. A new R740 GPU server for debugging purposes will be installed in the near future with 2 x NVIDIA Tesla T4 GPU cards with 16GB and 2560 Cuda cores each. The cluster is currently running Ubuntu Server 18.04 and 20.04, and Slurm 20.02.5. Each compute node has Matlab (latest version 2020a) installed, and Anaconda 3 and Anaconda 2 installed with Tensorflow, Pytorch and more. Neuroimaging and computational modelling packages include FSL, SPM, HDDM and more. New storage hardware is currently being set up to provide a total of over 120TB for computing, backup and archive space.
Governance and decision making
Oversight of the cluster is currently provided by the Chair of the Department of Psychology, the Associate Chair Research of the Department of Psychology, and the Computational Research Support Specialist. All Principle Investigator users are consulted on major issues and decisions pertaining to system upgrades, maintenance, ongoing operations, and policies.
Support is provided by the Computational Research Support Specialist (Weijun Gao), whose responsibilities include:
- Overseeing hardware repairs and upgrades.
- Installing and maintaining software.
- Monitoring system performance and usage.
- Providing training (e.g. tutorials and demonstrations on a group or one-to-one basis).
- User account management.
- Trouble-shooting and code debugging.
To open a user account, individuals are required to submit a signed cluster user registration form (available from the technical support website). By default, an individual’s username is their UTORid and authentication is provided by a UTSC authentication server so that users can use their UTORid login details to access the cluster. A local Linux account will be created if a user’s UTOR access with UTSC isn’t available. By default, each user account comes with 500 GB of storage quota. The home quota can be increased on request and alternative computing storage is available when needed. When an individual no longer requires access to the cluster, their data will be archived (see ‘Data storage and backup’).
User fees are necessary to cover the cost of software licenses (i.e. Matlab) and system maintenance/upgrades. Current fees are $300/year for faculty/post-doc/graduate student accounts and $75/year for undergraduate student accounts. Fees are taken at the start of the academic year in September following approval by each Principle Investigator, and fees for accounts that are opened at other times during the academic year will be prorated. Fees may also be reduced or waived on a case-by-case basis should research funds be a serious concern for a Principle Investigator (e.g. a Principle Investigator is between grants).
Data storage and backup
Computing storage, backup storage and archive storage are all RAID storages. Incremental backups of computing data are conducted on a weekly basis, with the three most recent copies kept on a rolling basis. Once the new storage hardware is installed, the 12 most recent weekly backups will be kept. Archived data are kept indefinitely and will be mirrored in the new storage system. All users are encouraged to clean up their data regularly, delete superfluous files, and archive data that are not needed. Requests for additional computing data storage space will not be granted until users can demonstrate that they have outgrown their existing quota and done their best to clean up their data. All users are requested to document their data and scripts (e.g. for archiving purposes) before leaving the department.
Optimal running of the cluster relies on all users being ‘good citizens’. To facilitate this, a queue system, SLURM, has been implemented and ALL jobs must be submitted to this to be executed. Users should consult the technical support website for details on how to use SLURM and which partitions (i.e. cpu, interactive) are most appropriate for their jobs.
- CPU – during times of high usage, each user will use no more than 32 CPU cores. When system usage is low, each user can use a maximum of 96 CPU cores. However, they must be prepared to immediately reduce their CPU core usage back down to 32 CPU cores should there be an increase in system demand. In the latter scenario, checkpointing is highly recommended (see ‘Checkpointing’). Users are strongly encouraged to monitor cluster usage regularly on the technical support website and if in doubt, to contact the Computational Research Support Specialist.
- GPU – Each user can use 1 GPU maximum and should not submit more than one job at a time. The GPU partition will have a shorter wall-time. Once the new GPU server for debugging is available, GPU users should only use the debugging server for debugging purposes. Users can submit CPU-only jobs to the GPU partition; however, a shell/Perl script will be developed to monitor the GPU partition usage and adjust job priorities for fairness and resource usage efficiency. For example, a job owner will receive warning emails if the job has a GPU card taken but the card hasn’t been used for over two hours.
- Resource allocation limits will be adjusted according to available resource changes and computing need changes in the future.
Principle Investigators are welcome to integrate their own servers with the cluster so that they can be used by other researchers and managed by the Computational Research Support Specialist. A number of conditions must, however, be agreed on in advance:
- The server will be integrated into the cluster SLURM system and made available to all cluster users equally (i.e. no users will have priority access).
- The software environment of the server will be identical to that of the other cluster compute nodes. Specific software packages required by the joining server will be installed across the whole cluster.
- For data security and cluster consistency purposes, only the Computational Research Support Specialist and the backup administrator will have admin/root access to the server.
Application-level checkpointing is recommended if it is available with the software package(s) that a job is using. System-level checkpointing is possible but requires manual restarting and root access – this should be used only when absolutely necessary and should be implemented with the help of the Computational Research Support Specialist.