Here, I have compiled posts that are useful for students and researchers at the Montreal Neurological Institute-Hospital (Neuro) who often use DRAC.

Most recent posts are:

Split‑pipe

split-pipe a bioinformatics pipeline developed by [Parse Biosciences] for processing single-cell RNA sequencing data generated using their Split-Seq technology. The pipeline takes raw FASTQ files and produces processed data in the form of a cell-by-gene count matrix, which can be used as input for popular open-source analysis tools such as Scanpy and Seurat. It enables streamlined and efficient processing of sequencing results from start to finish.

Submitting job

Please do not run any computations on the login node. The login node resource (RAM\CPU) is intended ONLY for tasks such as editing scripts and downloading\transferring data. To learn how to properly submit jobs, please refer to the following guide: Digital Research Alliance of Canada. Using the login node for computations can result in a warning from DRAC, and repeated violations lead to account suspension.

Optimal CPU\RAM

Since we are using shared resources, please avoid submitting too many jobs using excessive RAM/CPU (if your job requested 30 GB RAM and only used 10 GB, DRAC adds 20 GB as wasted resources to your account 😞 ). And DRAC may hold your jobs in the pending state if your resource wasted usage is too high. If you notice your jobs are pending for a long time longer than normal on Beluga (our main resource), consider canceling them and resubmitting on other DRAC resources like Narval or Cedar.

If you're writing code in Python or R, run your code in multithreaded mode to improve performance and utilize all available CPU and RAM resources. I often use the command "seff JOBID" to check how much RAM and CPU my job used. This helps me request the appropriate resources for future runs. Many pipelines support multithreading, so make sure to enable it to fully utilize the available resources. When I'm unsure about a pipeline's resource usage, I usually run a small test job first to estimate the RAM usage. Based on that, I adjust the resource requests accordingly. Requesting 4 CPUs and 16 GB of RAM is typically optimal, and Beluga usually allocates these resources quickly. However, if you request 16 GB of RAM for 4 days and your job only uses 20% of it, you're effectively wasting 80% of the allocated resources. Don’t worry if your job runs for only a short time—that’s perfectly fine. The real concern is when you submit many jobs at once, each reserving resources for several days. If those jobs end up wasting a large amount of CPU or RAM, DRAC may deprioritize your future jobs due to inefficient resource usage.

Managing Account in DRAC

How to create an account

  1. Register for account with the Digital Research Alliance of Canada (DRAC).
  2. For sponsored roles, you will need your supervisor’s CCI. The sponsor will need to confirm the request before you can access the resources.

  3. The information to fill in the form should look as follows:

    • Institution: Calcul Québec: McGill University
    • Department: Human Genetics (see NOTE below)
    • Position. Master’s Student (choose appropriate option)
    • Sponsor : supervisor’s CCI
    • Make this role primary? Yes
    • Disable old roles? No

NOTE: your department affiliation can also be Neurology, Medicine, etc. If you’re unsure, put “Neurology”

  1. You can add new roles to an existing account by logging in to your account, then going to My Account > Apply for a New Role. Enter the required information as listed above, and enter the new CCI.

  2. Wait until you receive mail confirmation that your account is now active. If you are unsure, send a message to neurobioinfo@mcgill.ca

  3. Log in to your DRAC account to find your username

  4. Top of page: “Account for First_Name Last_Name (CCI: abc-123, Username: user)”

  5. The address for a particular resource, along with its specs, can be found on the DRAC Wiki (left menu, under ‘Resources’).

  6. Beluga: beluga.computecanada.ca

  7. To access the resources, you will need an SSH client.

  8. Windows: Download PuTTY. Insert the relevant info into the GUI.

    1. Hostname: beluga.computecanada.ca
    2. port: 22
  9. Linux / Mac:

    1. Open a terminal (Ctrl + alt + T, or +T).

    2. ssh beluga.computecanada.ca

  10. As of April 2024, all DRAC (i.e. Compute Canada) servers require multifactor authentication. See the official documentation for how to set that up.

AFTER YOUR ACCOUNT HAS BEEN ACTIVATED

  1. Default user profiles, software environment and common links:

  2. Common links:

    1. We have created links to several key shared directories, containing such things as common software, data and analysis directories.
    2. You can copy them to your own home directory for ease of access.
  3. Default user profiles:

    1. We have created several default configuration files. These are used to set environment variables, create common aliases and functions, and set other parameters which will help you work more safely and easily in our shared software environment.
    2. You can copy them to your own home directory and they will be loaded automatically at each subsequent login.
    3. Pro tip: you can create your own environment variables, command aliases and custom functions by adding them to your linux shell config files, e.g. ~/.bash_profile or ~/.bashrc
  4. Default software environment:

    1. We have installed a large number of libraries, standalone programs, as well as modules and packages for standard programs such as R, python and perl.
    2. If you copy the configuration files and links from the previous steps, you should be able to access all the common software, starting from the ~/soft (/home/$USER/soft) directory
  5. Simple command to copy all default config files and links
    ```{verbatim}

    1. cp -a /lustre03/project/6001220/COMMON/{.[a-z]*,*} ~/ ```
    2. Run this command in your terminal, after you have connected to beluga
  6. Common DOs and DON’Ts

  7. DO

    1. Create your own subfolder in the shared ~/runs folder

      1. Use this folder to store your commands, logs and analysis final results
      2. It will make it easier later for others to access your analyses, while ensuring that they are safe from accidental deletion
    2. Store large temporary input files for your analyses in your $SCRATCH folder (/scratch/$USER)

      1. Our group has limited /project space; it should be used for semi-permanent storage only, i.e. the duration of a project
    3. Copy your raw data from its source on the /nearline filesystem, to your destination on the /project or /scratch (preferred) filesystems

  8. DON’T

    1. Use the /nearline filesystem for any file operations, except as a source from which to copy raw data
      1. Do not use /nearline files directly as inputs for any program. This filesystem is intended only for long-term data archiving

Some guidelines onn the use of shared space

1- /project space is for medium-term storage of final results, or key files for long-term ongoing projects. Your raw inputs and intermediary files should all be on the /scratch filesystem, where they will be automatically deleted after 60 days of disuse. All your analyses should be run on /scratch.

2- For the files you will store on /project, please create your own subfolder in the shared ~/runs folder (absolute path: /project/rrg-grouleau-ac/COMMON/runs). Use this folder to store your commands, logs and analysis final results. It will make it easier later for others to access your analyses, while ensuring that they are safe from accidental deletion.

3- The command disksuage_explorer is a great way to search through your files to find what’s using the most space. If your /project data is stored in ~/runs as suggested in point #2, try the following command: /project/rrg-grouleau-ac/COMMON/runs/[your folder] This will sort your folders by size with largest on top; you can interactively navigate to quickly see what is using so much space

4- For any projects which have already been completed/published, there should be virtually nothing left in /project space - certainly no large files. If your publication mandates data sharing, make sure that the files to share a) have already been deposited in whatever online depository is required, and/or b) are all archived on /nearline. If a project is complete, consider compressing all long-term files into an indexed archive file . Not only will this save /project space, but it will also make it easy to archive to /nearline and restore should someone want it later.

5- When cleaning up your /project space, don't just blindly copy all your files en masse to /nearline; we're very limited in space there too!

6- Often the biggest space-eaters are alignments (bam/cram files) and raw read data (fastq/bam). There is almost never a good reason to keep these files on /project - if you’re analyzing them, they should be on /scratch. If you’re done with them, they should either be archived on /nearline, or just plain deleted!

7- If you have any large files you want to keep, make sure they are in a compressed format. i.e. any large vcf files you are using should absolutely be compressed (gzip/bzip etc.).

8- Any large intermediate files should be deleted. For e.g. if you're running plink/SKAT and you have tons of genotype files or vcf files from intermediate cleanup + imputation steps, delete them! In fact, intermediate files should only exist on /scratch because that’s where you should be running our analyses!

Hail

Hail module is an open-source, scalable framework for exploring and analyzing genomic data which is a module in Python on the top of Apache Spark.. For more detail, refer to its tutorial.

Login Issues

If you're unable to log in to DRAC or if it gets stuck after entering your credentials, first check the status page for any reported incidents. If none are reported, the issue may be due to system slowness, which DRAC may be able to resolve. If you are still in the terminal, press Ctrl+C to interrupt, then try logging in again and performing basic operations, as the system may be slow.