Convert Illumina IDAT files to VCF/PED


Contents


Installation


See how to install the genemapgwas workflow here

After installing the workflow, navigate to the illumina folder.

cd genemapgwas-0.1/illumina/

Edit the Configuration file


In the illumina folder is a master configuration file ‘nextflow.config’ where all your input options will be specified.

Only edit the lines displayed below.

params {

  // data-related parameters
  idat_dir          = ""
  manifest_bpm      = ""
  manifest_csv      = ""
  cluster_file      = ""
  fasta_ref         = ""
  bam_alignment     = ""                          // if processing in build 38 (hg38), provide a bam alignment file
  output_prefix     = ""
  output_dir        = ""


  // computing resource-related parameters
  account           = ""                          // if using SLURM or PBSPRO, provide your cluster project or account name
  partition         = ""                          // if using SLURM or PBSPRO, specify the queue or partition name
  njobs             = 5                           // number of jobs to submit at once each time
  containers_dir    = ""                          // if using singularity containers, provide a path where the images will be stored.
                                                  // NB: The path should have sufficient storage capacity


  // IDAT to GTC: requires less resources
  idat_threads      = 4                           // number of cpus/threads per job
  idat_max_time     = 1.h                         // maximum time each job is allowed to run (h - hours; m - minutes)
  idat_max_memory   = 2.GB                        // maximum memory each job is allowed to use (GB - gigabyte; MB - megabyte)


  // GTC to VCF: requires more resources
  gtc_threads       = 10                           // number of cpus/threads per job
  gtc_max_time      = 3.h                         // maximum time each job is allowed to run (h - hours; m - minutes)
  gtc_max_memory    = 3.GB                        // maximum memory each job is allowed to use (GB - gigabyte; MB - megabyte)
}

Config parameters

Let’s go through each parameter one after the other


idat_dir:

  • The illumina gencall algorithm needs you to provide a folder that contains pairs of green and red idat files (see example below 034928383).
  • You many have multiple folders, each containing a different number of idat pairs (example below shows one idat folder containing two pairs of idat files).
  • The folders are contained within a folder called intensities. This is the path we provide to the workflow (the absolute path).
  • The workflow will grab all the sub-folders containing the idat files and pass them to genecall, hence enabling batch processing.
.
├── intensities
│   ├── 034928383
│   │   ├── file1.grn.idat
│   │   ├── file1.red.idat
│   │   ├── file2.grn.idat
│   │   └── file2.red.idat
│   ├── 236572634
│   ├── 562525763
│   └── 748974817

manifest_bpm, manifest_csv and cluster_file:

  • These files are specific to the chip that was used to genotype your samples and should be provided by the genotyping company.
  • Request them if you don’t have them.
  • For the H3Africa chip, you can get these files from here.
  • NB: The cluster file can be custom-made. See this document for more information.

fasta_ref and bam_alignment:

  • For hg19 dataset, only the reference fasta file is required.
  • The gtc2vcf tool adds a functionality to process calls generated in hg19 in hg38. This requires you to realign the flanking sequences from the manifest files. This is described here.
  • NB: Provide these with their absolute paths.

output_prefix: Output file name

output_dir: Path to save output


If working on a SLURM or PBSPRO cluster, you might be associated to a specific group

account:

  • If you are required to specify an account name (slurm) or project name (pbspro) in your sbatch or qsub job submission scripts, provide it here.

partition:

  • If you are required to select a specific partition (queue) when requesting resources, provide it here.

If you are not required to specify any of these in your job submission scripts, leave these parameters empty.


njobs:

  • Number of jobs (processes; e.g. each idat file pair) to submit to the cluster each time.
  • Some accounts have limits to the number of jobs they can submit. It is best to set a value that does not exceed this limit.

containers_dir:

  • Path where docker images (the gencall image) will be pulled from our dockerhub and stored.
  • This path should have sufficient storage capacity.
    • The gencall image is about 530 MB in size.
    • The plink2 test image is abot 80 MB.

idat_threads, idat_max_time and idat_max_memomy:

  • These resources will be used when converting from IDAT to GTC with gencall.
  • From experience, this step works pretty well with minimal resources.
  • Try it with the default ones, and only edit if you are unsatisfied with the outcome.
  • Requesting minimal resources will enable your jobs to fit within the cracks and get submitted well ahead of other jobs that are requesting massive resources.

gtc_threads, gtc_max_time and gtc_max_memomy:

  • The GTC to VCF step is only one job, processed using bcftools.
  • From experience, it requires a little more resources.
  • Again, try with these default ones.

Running the workflow


The workflow runs on the concept of nextflow profiles.

There are three profile categories:

  • executors: there are three executors based on where you are working
    • local: this can be used anywhere; your computer (laptop) or any cluster (slurm, pbspro, etc).
    • slurm: cluster running a slurm job workload manager/scheduler.
    • pbspro: cluster running a pbspro job workload manager/scheduler.
  • containers: there are three containers
    • apptainer: formerly singularity. Some clusters might still not run it.
    • singularity: now apptainer. Most clusters might still run it.
    • docker: Most clusters do not run docker for security reasons. It can be used on local computers.
  • references: there are two references
    • hg19: this requires only the fasta file be specified.
    • hg38: this requires a fasta and a bam alignment file (see above).

The workflow commandline is built as follows.

nextflow run <workflow script> -profile <executor>,<container>,<reference> -w <work directory>

Nextflow can generate many large intermediate files. So it is important to specify an appropriate work directory with the -w option.

Nextflow needs these intermediate files to resume your job in case the workflow terminates without completing. So only delete the work directory when you are sure your workflow is complete and you are satisfied with all the outputs. Else, rerunning it will start from scratch.

Back to top

Test


This example will pull and use a plink2 image (which is light-weight) using singularity.

First, make sure nextflow and singularity are loaded. Check how these are loaded on your system

module load nextflow

module load singularity


Running with all three profile categories

nextflow run test.nf -profile test,local,singularity,hg19 -w "./work/"

NB: For the test, we add a test profile before the three categories we mentioned above.

If all goes well, you should see a message like this

N E X T F L O W  ~  version 21.10.6
Launching `test.nf` [jolly_jennings] - revision: c99dfcf444

        IDAT to VCF: TEST

idat_dir        = YOUR IDAT PARENT DIRECTORY
manifest_bpm    = CHIP-SPECIFIC BPM MANISFEST
manifest_csv    = CHIP-SPECIFIC CSV MANISFEST
cluster_file    = YOUR CLUSTER FILE
fasta_ref      = HUMAN REFERENCE FASTA
output_prefix   = YOUR OUTPUT FILE NAME PREFIX
output_dir      = PATH WHERE YOUR OUTPUT IS STORED
containers_dir  = PATH WHERE CONTAINERS ARE STORED
account         = humgen        # CHANGE TO YOURS
partition       = sadacc-short      # CHANGE TO YOURS

executor >  local (1)
[5f/ddaf29] process > plink (processing ... YOUR IDAT PARENT DIRECTORY) [100%] 1 of 1 ✔
PLINK v2.00a4.7LM 64-bit Intel (12 Sep 2023)   www.cog-genomics.org/plink/2.0/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3

--pedmap <prefix>  : Specify .ped + .map filename prefix.
--ped <filename>   : Specify full name of .ped file.
--map <filename>   : Specify full name of .map file.

Workflow completed at: 2023-09-16T13:35:26.858+02:00
     Execution status: OK

You might see a few warnings:

  • regarding echo and debug. These are caused by different versions of nextflow and do not pose any issues.
  • regarding singularity cache directory. As long as you set a value for containers_dir in your nextflow.config file, it should be no problem.

If the containers directory is not set, the workflow will create one in your work directory.


Running without selecting an executor

nextflow run test.nf -profile test,singularity,hg19 -w "./work/"

The local executor will be used by default.

The output should be similar as above. However this time, the plink2 image will not be redownloaded since it is already present in the containers directory we set.

Running on a cluster without selecting a container

# first load plink2 on your cluster

# on the UCT HPC cluster, it is
module load software/plink-2.00a

# then run
nextflow run test.nf -profile test,slurm,hg19 -w "./work/"
N E X T F L O W  ~  version 21.10.6
Launching `test.nf` [astonishing_solvay] - revision: c99dfcf444

        IDAT to VCF: TEST

idat_dir        = YOUR IDAT PARENT DIRECTORY
manifest_bpm    = CHIP-SPECIFIC BPM MANISFEST
manifest_csv    = CHIP-SPECIFIC CSV MANISFEST
cluster_file    = YOUR CLUSTER FILE
fasta_ref      = HUMAN REFERENCE FASTA
output_prefix   = YOUR OUTPUT FILE NAME PREFIX
output_dir      = PATH WHERE YOUR OUTPUT IS STORED
containers_dir  = PATH WHERE CONTAINERS ARE STORED
account         = humgen        # CHANGE TO YOURS
partition       = sadacc-short      # CHANGE TO YOURS

executor >  slurm (1)
[3d/e89581] process > plink (processing ... YOUR IDAT PARENT DIRECTORY) [100%] 1 of 1 ✔
PLINK v2.00a3.6LM AVX2 Intel (14 Aug 2022)     www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3

--pedmap <prefix>  : Specify .ped + .map filename prefix.
--ped <filename>   : Specify full name of .ped file.
--map <filename>   : Specify full name of .map file.

Workflow completed at: 2023-09-16T13:36:57.528+02:00
     Execution status: OK

You can put these in you job submission script. See below.

  • If submittng your job to a workload manager like slurm, you must specify your account and partition in the nextflow.config file.

  • But if you already requested an interactive job with enough respurces as what you specified in your config file, then either select a local executor or do not select an executor at all.

Example on the UCT HPC cluster: test.sbatch

#!/usr/bin/env bash
#SBATCH --account humgen
#SBATCH --partition sadacc-short
#SBATCH --nodes 1
#SBATCH --time 00:05:00
#SBATCH --ntasks 1

module load software/plink-2.00a

nextflow run test.nf -profile test,slurm,hg19 -w "./work/"
  • In my sbatch script, I only requested 1 thread/cpus (–ntasks) because I am only running one nextflow commandline.

  • This will ensure that the job is submitted quickly, allowing nextflow to request the rest of the resources for each process.

  • We have also set small time (5mins - in HH:MM:SS) because this is a test. When running the actual jobs, it is best to set enough time (more than the maximum time requested in the config file) to avoid premature termination of the jobs.

Using the hg38 reference

command:

nextflow run test.nf -profile test,slurm,hg38 -w "./work/"

output:

N E X T F L O W  ~  version 21.10.6
Launching `test.nf` [loving_elion] - revision: c99dfcf444

        IDAT to VCF: TEST

idat_dir        = YOUR IDAT PARENT DIRECTORY
manifest_bpm    = CHIP-SPECIFIC BPM MANISFEST
manifest_csv    = CHIP-SPECIFIC CSV MANISFEST
cluster_file    = YOUR CLUSTER FILE
fasta_ref     = HUMAN REFERENCE FASTA
bam_alignment = BAM ALGINMENT FOR YOU REFERENCE IN hg38
output_prefix   = YOUR OUTPUT FILE NAME PREFIX
output_dir      = PATH WHERE YOUR OUTPUT IS STORED
containers_dir  = PATH WHERE CONTAINERS ARE STORED
account         = humgen        # CHANGE TO YOURS
partition       = sadacc-short      # CHANGE TO YOURS

executor >  slurm (1)
[49/98b7fe] process > plink (processing ... YOUR ... [100%] 1 of 1 ✔
PLINK v2.00a3.6LM AVX2 Intel (14 Aug 2022)     www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3

--pedmap <prefix>  : Specify .ped + .map filename prefix.
--ped <filename>   : Specify full name of .ped file.
--map <filename>   : Specify full name of .map file.

Workflow completed at: 2023-09-16T14:19:53.277+02:00
     Execution status: OK

Back to top

Run the workflow with your project


  • Make sure you have provided all necessary parameter values in the nextflow.config file.

  • Singularity appears to be the best options since docker and apptainer might not be available, and we might not be able to install iaap-cli on the cluster.

If your IDAT files end with .gz (.idat.gz), make sure to unzip them before running the workflow.

gunzip xxxxxxx.idat.gz

Locally on a compute node (interactive job)

  • You have requested an interactive job on your cluster with minimal resources.
  • So you still need nextflow to submit each job requesting the resources you specified in the config file.
nextflow run idat2vcf.nf -profile slurm,singularity,hg19 -w "./work/"

  • You have requested an interactive job on your cluster with enough resources.
  • So nextflow will request resources be split from what you were already allocated.
nextflow run idat2vcf.nf -profile local,singularity,hg19 -w "./work/"

Submit as a job on a cluster

Place the first command above in your sbatch job submission script and submit on your cluster.

#!/usr/bin/env bash
#SBATCH --account humgen
#SBATCH --partition sadacc-short
#SBATCH --nodes 1
#SBATCH --time 06:00:00
#SBATCH --ntasks 1

nextflow run idat2vcf.nf -profile slurm,singularity,hg19 -w /scratch/xxxxxxxxx/xxxxxxxx/gwas/work/
  • The time has been increased to accommodate the project.

  • The work directory has been changed to a location with enough storage capacity.