1.Introduction to LDWP
LDWPŁ¨short for Linked region detection without pedigreeŁ© is a software package for linkage analysis that finds the mutation regions for the case where the input individuals are closely related, but the pedigree is not known. A typical example is that in the pedigree below the individuals in the dotted rectangle are closely related,and the genotype data of them are known, whereas the genotype data of individuals outside the dotted rectangle are not available and even the pedigree may not be clear. LDWP takes the genotype data of the individuals in the dotted rectangle and the diagnoses (diseased or normal) information of each individual in the dotted rectangle as its input, and report the mutation regions.
This package is implemented in C++, and we provide .exe files for download. Currently the package is only for Windows users. Click LDWP to downdload both DOS and UI versions.
3. How to USE
Before you use this package, you should first unzip it.
This package contains 3 directories: DOS, UI and DATA. DOS is a version to run under DOS(or on console). UI is a version with graphic user interface. And DATA contains the input data files neccessary to run this package.
1.DATA: This directory contains the input files necessary to run this package. Before running this package, you should put the necessary files into the directory containing the executable files.
2.DOS: This directory contains the executable file linkage-formal.exe that can be run under DOS. To run this program, first put the files haplotype.phased,genotype.txt and children.ped into this directory, and then double-click the file linkage-formal.exe. After about 20 minutes, the computed results will be shown on the console, and they will also be saved in three .txt files.
3.UI: This package contains the executable file regionDetect-formal.exe with a simple user interface implemented with MFC. click here to see how to use it.
4. Input files
LDWP needs three input files:
children.ped contains basic information for the input individuals. Each row contains information for one individual. There are 3 columns, person-name, gender(0 for male, 1 for female), and disease status(0 for diseased status, 1 for healthy status). Below is a small example containing three individuals A, B, C.
A 0 1
B 0 1
C 1 1
The file genotype.txt contains the genotype data for all input individuals. The first row includes rsID, snp_position, and individual names, as used in the pedigree file. Each row represents a SNP site. The first column is the rsID for the SNPs. The second column is the position of the SNPs, after that each column is an individual's genotype data, e.g., CT, CC, etc. The following is a small example containing 3 individuals, named M, N and WŁ¬respectively, with genotype data for five SNP sites.
rsID snp_position M N W
rs10458597 554484 CT TT CC
rs2185539 556738 CC CT CC
rs11240767 718814 CC CC CC
rs12564807 724325 AG AA AA
rs3131972 742584 GG GG GG
(3)Haplotype information file:
The file "haplotype.phased" contains haplotype data for chromosome 1 of 170 unrelated Japanese in Tokyo and Han Chinese in Beijing, China. This file can be downloaded from the International HapMap Project(http://hapmap.org).The first column is the rsID for SNPs, the second column is the position of the SNPs, after that each column is a haplotype of an individual. The following is a small example containing 3 individuals with genotype data for five SNP sites.
rsID snp_position M_A M_B N_A N_B W_A W_B
rs10458597 554484 C T T T C C
rs2185539 556738 C C C T C C
rs11240767 718814 C C C C C C
rs12564807 724325 A G A A A A
rs3131972 742584 G G G G G G
The package will output the predicted linked regions in three files: linked_region_10f.txt, linked_region_10f_3500.txt, and linked_region_10f_extend2000.txt for the three different algorithms, Algorithm 2, Algorithm 3 and Algorithm 4, respectively. Each reported region is represented by its starting SNPs and the ending SNPs. A typical output is as below:
The detectd regions are:
The simulation version contains an extra package to generate the input individuals' genotype data. The program takes a pedigree and two haplotype segments (for a whole chromosome) of each founder in the pedigree as input. It generates the two copies of the haplotype segments for the remaining individuals in the pedigree using the standard chi-square model for recombination with m equals 4 and according to male/female averaged genetic map for chromosome 1 downloaded from HapMap(http://hapmap.org).
The simulation version needs the following input files:
(1) The file "haplotype.phased" containing the haplotype data for chromosome 1 of 170 unrelated Japanese in Tokyo and Han Chinese in Beijing, China. (mentioned before)
Pedigree.ped describes the pedigree. Each row represents an individual with five fields, the name of the individual,the father's name (-1 indicates unavailable), the mother's name (-1 indicates unavailable),gender (0 for male, 1 for female), and disease status (0 for diseased status, 1 for normal).
Below is an example containing three individuals A, B,and C, where A is the father, B is the mother and C is the child.
A -1 -1 0 0
B -1 -1 1 1
C A B 1 1
(3) Genetic map file:
The file "genetic_map.txt" contains the physical loci information for the SNP markers in "genotype.txt". It can also be downloaded from the International HapMap Project(http://hapmap.ncbi.nlm.nih.gov/). The file is required for generating children.
(4) The file children.ped (mentioned before) is also required. The users have to create the file children.ped according to Pedigree.ped.
To run the simulatin version, double-click linkage.exe under the directory DOS, then you will be asked to input the number of all the individuals in the pedigree(e.g. 50), the number of the input individuals for LDWP (e.g.10), and the assumed mutation position causing the disease (e.g.100000). After about 20 minutes, the results will be shown on the console and saved in .txt files in the DOS directory. A typical output is as below:
diseased region diseased_length get_region get_length precision recall
[96100, 102185] 6086 [92500, 102500) 10000 0.6086 1
Here, diseased region indicates the real linked region (see the paper for definition) and get_region indicates the linked region detected by our package.
You can also run the file regionDetect.exe with a graphic user interface under the directory UI. Click here for instructions.