Conda
Conda is not recommended by DRAC (https://docs.alliancecan.ca/wiki/Anaconda/en),
Here, I have compiled posts that are useful for students and researchers at the Montreal Neurological Institute-Hospital (Neuro) who often use DRAC.
Most recent posts are:
Conda is not recommended by DRAC (https://docs.alliancecan.ca/wiki/Anaconda/en),
split-pipe a bioinformatics pipeline developed by [Parse Biosciences] for processing single-cell RNA sequencing data generated using their Split-Seq technology. The pipeline takes raw FASTQ files and produces processed data in the form of a cell-by-gene count matrix, which can be used as input for popular open-source analysis tools such as Scanpy and Seurat. It enables streamlined and efficient processing of sequencing results from start to finish.
Please do not run any computations on the login node. The login node resource (RAM\CPU) is intended ONLY for tasks such as editing scripts and downloading\transferring data. To learn how to properly submit jobs, please refer to the following guide: Digital Research Alliance of Canada. Using the login node for computations can result in a warning from DRAC, and repeated violations lead to account suspension.
Since we are using shared resources, please avoid submitting too many jobs using excessive RAM/CPU (if your job requested 30 GB RAM and only used 10 GB, DRAC adds 20 GB as wasted resources to your account ). And DRAC may hold your jobs in the pending state if your resource wasted usage is too high. If you notice your jobs are pending for a long time longer than normal on Beluga (our main resource), consider canceling them and resubmitting on other DRAC resources like Narval or Cedar.
If you're writing code in Python or R, run your code in multithreaded mode to improve performance and utilize all available CPU and RAM resources. I often use the command "seff JOBID" to check how much RAM and CPU my job used. This helps me request the appropriate resources for future runs. Many pipelines support multithreading, so make sure to enable it to fully utilize the available resources. When I'm unsure about a pipeline's resource usage, I usually run a small test job first to estimate the RAM usage. Based on that, I adjust the resource requests accordingly. Requesting 4 CPUs and 16 GB of RAM is typically optimal, and Beluga usually allocates these resources quickly. However, if you request 16 GB of RAM for 4 days and your job only uses 20% of it, you're effectively wasting 80% of the allocated resources. Don’t worry if your job runs for only a short time—that’s perfectly fine. The real concern is when you submit many jobs at once, each reserving resources for several days. If those jobs end up wasting a large amount of CPU or RAM, DRAC may deprioritize your future jobs due to inefficient resource usage.
To use the Ensembl Variant Effect Predictor VEP in Beluga, follow the steps outlined below:
For sponsored roles, you will need your supervisor’s CCI. The sponsor will need to confirm the request before you can access the resources.
The information to fill in the form should look as follows:
NOTE: your department affiliation can also be Neurology, Medicine, etc. If you’re unsure, put “Neurology”
You can add new roles to an existing account by logging in to your account, then going to My Account > Apply for a New Role. Enter the required information as listed above, and enter the new CCI.
Wait until you receive mail confirmation that your account is now active. If you are unsure, send a message to neurobioinfo@mcgill.ca
Log in to your DRAC account to find your username
Top of page: “Account for First_Name Last_Name (CCI: abc-123, Username: user)”
The address for a particular resource, along with its specs, can be found on the DRAC Wiki (left menu, under ‘Resources’).
Beluga: beluga.computecanada.ca
To access the resources, you will need an SSH client.
Windows: Download PuTTY. Insert the relevant info into the GUI.
Linux / Mac:
Open a terminal (Ctrl + alt + T, or ⌘+T).
ssh beluga.computecanada.ca
As of April 2024, all DRAC (i.e. Compute Canada) servers require multifactor authentication. See the official documentation for how to set that up.
Default user profiles, software environment and common links:
Common links:
Default user profiles:
Default software environment:
Simple command to copy all default config files and links
```{verbatim}
Common DOs and DON’Ts
DO
Create your own subfolder in the shared ~/runs folder
Store large temporary input files for your analyses in your $SCRATCH folder (/scratch/$USER)
Copy your raw data from its source on the /nearline filesystem, to your destination on the /project or /scratch (preferred) filesystems
DON’T
/nearline filesystem for any file operations, except as a source from which to copy raw data /nearline files directly as inputs for any program. This filesystem is intended only for long-term data archiving1- /project space is for medium-term storage of final results, or key files for long-term ongoing projects. Your raw inputs and intermediary files should all be on the /scratch filesystem, where they will be automatically deleted after 60 days of disuse. All your analyses should be run on /scratch.
2- For the files you will store on /project, please create your own subfolder in the shared ~/runs folder (absolute path: /project/rrg-grouleau-ac/COMMON/runs). Use this folder to store your commands, logs and analysis final results. It will make it easier later for others to access your analyses, while ensuring that they are safe from accidental deletion.
3- The command disksuage_explorer is a great way to search through your files to find what’s using the most space. If your /project data is stored in ~/runs as suggested in point #2, try the following command: /project/rrg-grouleau-ac/COMMON/runs/[your folder] This will sort your folders by size with largest on top; you can interactively navigate to quickly see what is using so much space
4- For any projects which have already been completed/published, there should be virtually nothing left in /project space - certainly no large files. If your publication mandates data sharing, make sure that the files to share a) have already been deposited in whatever online depository is required, and/or b) are all archived on /nearline. If a project is complete, consider compressing all long-term files into an indexed archive file . Not only will this save /project space, but it will also make it easy to archive to /nearline and restore should someone want it later.
5- When cleaning up your /project space, don't just blindly copy all your files en masse to /nearline; we're very limited in space there too!
6- Often the biggest space-eaters are alignments (bam/cram files) and raw read data (fastq/bam). There is almost never a good reason to keep these files on /project - if you’re analyzing them, they should be on /scratch. If you’re done with them, they should either be archived on /nearline, or just plain deleted!
7- If you have any large files you want to keep, make sure they are in a compressed format. i.e. any large vcf files you are using should absolutely be compressed (gzip/bzip etc.).
8- Any large intermediate files should be deleted. For e.g. if you're running plink/SKAT and you have tons of genotype files or vcf files from intermediate cleanup + imputation steps, delete them! In fact, intermediate files should only exist on /scratch because that’s where you should be running our analyses!
Hail module is an open-source, scalable framework for exploring and analyzing genomic data which is a module in Python on the top of Apache Spark.. For more detail, refer to its tutorial.
I want to share my expering of using HPC, as mr my day-to-day server for all my needs, you need 1- editor and 2- a cloud storage browser, what I use vscode and cyberducks
If you're unable to log in to DRAC or if it gets stuck after entering your credentials, first check the status page for any reported incidents. If none are reported, the issue may be due to system slowness, which DRAC may be able to resolve. If you are still in the terminal, press Ctrl+C to interrupt, then try logging in again and performing basic operations, as the system may be slow.