Example

Input Data
Basic usage

This tutorial provides an illustrative analysis of somatic mutation data using simulation data available here link with dNdS-Fun. Before running the analysis, please ensure that dNdS-Fun is installed. For installation instructions, refer to this link.

Input Data #

dNdS-Fun supports two types of input data:

Somatic mutation data with functional impact scores (annotated in the sixth column).
Somatic mutation data without functional impact scores. This type of data requires an additional CADD score database, available here link. Place the database in the path specified by our software or provide the correct path for the software to locate it. We support both GRCh37 and GRCh38 human genome; please use the corresponding model to ensure compatibility.

For instructions on annotating the functional impact score in the input data, please refer to the Preparing Reference Data section of this tutorial.

The example data required for this tutorial can be downloaded from the following link. Detailed information about the input data is provided below.

1. Input data format without functional impact scores #

# Load the example cds-exon data
input_dat = readRDS("../example_cds-exon.rds")

head(vcf_cds)

Sample1	chr14	20844257	G	T
Sample1	chr19	21558661	A	G
Sample2	chr1	192335033	C	A
Sample1	chr1	241875099	T	C
Sample2	chr16	48265863	G	A

A tab-delimited text file containing all somatic mutations sequenced from one or more individuals. Each row corresponds to a specific variant, and the five columns are: “IndividualID,” “Chromosome,” “Position,” “Ref,” and “Alt.”

A table-delimited text file contained all somatic mutation sequenced from multiple individuals. Each row corresponds to a specific variant, and the 5 columns are separate “IndividualID, Chromosome, Position, Ref, Alt”.

Note: Our software automatically adds CADD scores, but you can also provide mutation files with CADD scores. See below for the detailed format. If your input data format without functional impact scores, you will need to provide the CADD scores database, which can be downloaded from here (GRCh37) or here (GRCh38). You can use the parameter score_database to provide your CADD scores database path.

2. Split the CADD scores into different chromosomes #

To speed up your analysis and reduce the resource footprint of your tasks, we offer multi-threaded processing. You can split the downloaded CADD scores into 22 chromosomes using the following code, allowing you to analyze them simultaneously with multiple threads.

chr=22 #range[1-22]
wkDir="/your_path/"
mkdir=${wkDir}/chr/
wholefil="whole_genome_SNVs.tsv.gz" # downloaded from CADD scores, put into the current directory

rm ${wkDir}/chr/whole_genome_SNVs.tsv.gz.chr${chr}.gz.rankRawScore.GRCh37
echo -ne "#Chrom\tPos\tRef\tAlt\tRawScore\tPHRED\n" >${wkDir}/chr/whole_genome_SNVs.tsv.gz.chr${chr}.gz.rankRawScore.GRCh37
zcat ${wkDir}/${wholefil} | awk -v var=$chr '{if($1==var)print $0}' | sort | uniq >>${wkDir}/chr/whole_genome_SNVs.tsv.gz.chr${chr}.gz.rankRawScore.GRCh37
bgzip -c ${wkDir}/chr/whole_genome_SNVs.tsv.gz.chr${chr}.gz.rankRawScore.GRCh37 > ${wkDir}/chr/whole_genome_SNVs.tsv.gz.chr${chr}.gz.rankRawScore.GRCh37.gz
tabix -fp vcf ${wkDir}/chr/whole_genome_SNVs.tsv.gz.chr${chr}.gz.rankRawScore.GRCh37.gz

The path ${wkDir}/chr/ can be provided to dNdS-Fun as the parameter score_database. The maximum memory required for this step may reach up to 35GB.

3. Input data format with functional impact scores #

# Load the example cds-exon data
input_dat = readRDS("../example_cds-exon.rds")

head(impScore_cds)

Sample1	chr14	20844257	G	T	0.957112
Sample1	chr19	21558661	A	G	0.544228
Sample2	chr1	192335033	C	A	0.312798
Sample1	chr1	241875099	T	C	0.221462
Sample2	chr16	48265863	G	A	0.973034

The sixth column is the raw CADD score, annotated from the above the CADD database.

Basic usage #

library(dNdSFun)

# Documentations
help(dNdSFun)

1. Global selection estimation of dNdS-Fun #

This section provides an basic usage of dNdS-Fun.

mutsFile:
reg:
dichotomylog:
globaldnds_outFile:
genelevel_outFile:
score:
score_database:
iscv: NULL (The default).
model: 1 or other numberic variables.

For further details about the parameters, please refer to this link.

#  
module load R/4.0.5
library(dNdSFun)

export R_LIBS_USER="/your_path/R_LIB_4.0.5"
Rscript src/cal_CADD_dndsWGS.NEG.R ${mutsFile} data/GRCh37/${reg}.GRCh37.rda data/GRCh37/${dichotomylog} ${reg} ${genelevel_outFile} ${iscv} ${model} ${score} ${score_database}
#or
Rscript src/cal_CADD_dndsWGS.NEG.R ${mutsFile} data/GRCh37/${reg}.GRCh37.rda data/GRCh38/${dichotomylog} ${reg} ${genelevel_outFile} ${iscv} ${model} ${score} ${score_database}

If you provide the somatic mutations with functional impact scores, you can ignore the parameter score and score_database.

2. Global selection estimation with dNdS-Fun #

This section provides a global (genome-wide) selection estimation of somatic mutations using dNdS-Fun.

# Set the gene selection method in dNdS-Fun function 
library(dNdSFun)
readRDS("../outFile_cds-exon.rds")

head(globaldnds_outFile)

region wsel ci_low ci_high pval selection
cds 1.29 1.23 1.37 0.001 positive

The results are stored in CADD_dndsWGSout.

Global selection estimate: CADD_dndsWGSout$globaldnds_outFile
wsel: ω ratio
ci_low: lower 95% confidence of the ω ratio
ci_high: lower 95% confidence of the ω ratio
pval: p-value from the Chi-square test to determine whether the ω ratio differs from 1 (indicating no selection).
selection: predicted selection direction

3. Local selection estimation with dNdS-Fun #

The dNdS-Fun allows users to estimate selection at the local (gene or element-specific) scales. For five types of genomic regions, dNdS-Fun performs separate analyses for each gene. Here, we demonstrate one type of results: coding sequence (cds).

# Set the gene selection method in dNdS-Fun function 
library(dNdSFun)
readRDS("../outFile_cds-exon.rds")

head(genelevel_outFile)

gene_name	n_neutral    n_select    wsel_cv   psel_cv qsel_cv
ENSG00000115524.11::SF3B1   0   15  148.10    9.06e-12    1.71e-07
ENSG00000160201.7::U2AF1    0   7   194.33    9.73e-09    0.00018
ENSG00000171862.5::PTEN     0   6   201.18    4.42e-08    0.00083
ENSG00000163554.7::SPTA1    2   11  18.27     9.67e-07    0.01834