GenoCheck Documentation


This file describes in detail the error checking scheme that was implemented by M. G. Ehm, R. W. Cottingham, Jr., and M. Kimmel. The error checking system in conjunction with FASTLINK identifies individuals and loci likely to contain errors using a likelihood based method.


INTRODUCTION

As described in the papers:

M. G. Ehm, R.W. Cottingham Jr., and M. Kimmel. Error Detection in Genetic Linkage Data Using Likelihood Based Methods. Journal of Biological Systems, Vol. 3, No. 1 (1995) 13-25.

Download a PostScript file of the Biological Systems GenoCheck paper

M. G. Ehm, R. W. Cottingham Jr., and M. Kimmel. Error Detection in Genetic Linkage Data Using Likelihood Based Methods. American Journal of Human Genetics, Vol. 58, No. 1 (1996).

Download a PostScript file of the American Journal of Human Genetics GenoCheck paper


This documentation refers to version 1.0 of the error detection scheme described in the papers above. The error detection algorithm, called GenoCheck, uses an altered version of the ILINK program, called ILINKERR, from FASTLINK 2.2.

The occurrence of laboratory typing error in pedigree data for linkage analysis cannot be ignored. When studying linked markers between which crossovers rarely occur, errors in the data will often result in false recombinations. Erroneous recombinations in a dense map are given substantial weight thereby increasing the estimate of theta, the recombination fraction. In dense maps, theta approaches the error rate and most of all observed crossovers will be spurious. We present a method for detecting errors in pedigree data. The index is a variant of the likelihood ratio test statistic and is used to test the null hypothesis of no error for each individual at each locus versus the alternative hypothesis of error. High values of the index pinpoint individuals and loci with relatively unlikely genotypes. Power and significance studies using Monte Carlo methods show that the index detects errors for small values of theta with a small false positive rate.


THE PROCESS

When pedigree data are obtained by typing individuals, the observed genotype is equal to the true genotype unless a typing error has occurred. We represent error in pedigree data as incomplete penetrance of genotypes. The observed genotypes are considered phenotypes and may not correspond to the true genotypes due to errors. Therefore, modeling error in pedigree data is easily accomplished using the likelihood method of genetic linkage analysis by altering the penetrance function. Our method is designed to identify individuals and loci likely to contain errors. The method is equivalent to a hypothesis test for error for each individual and locus in the pedigree.

Each hypothesis test entails:

  1. specifying a penetrance function based on an assumed error rate
  2. calculating the difference between the log-likelihood of the data at the maximum likelihood estimates of theta assuming complete penetrance (i.e. no errors) and the log-likelihood of the data at the maximum likelihood estimates of theta assuming incomplete penetrance (errors possible)
  3. identifying test statistics with relatively large values as indicative of an unlikely genotype since large values are associated with more evidence for errors than for no errors.

The GenoCheck program implements steps (1)-(3). Its output is a file containing the values of the test statistic separated by family and locus and ranked in decreasing order.


THE FILES

The following is a list of the files associated with GenoCheck.


Setting up an Error Checking Run

In order to perform error checking on marker data, you must complete the following checklist.

  1. Note that the error checking capability is not available for sexlinked data, mutation data and sex difference data (male and female theta are assumed to be the same). The program will exit politely with an error message in these situations.
  2. The marker data being tested for errors must be in the affection status format. The program TOAFF will convert any data format to the affection status format. The file "inped" (pedin.dat format) should contain the pedigree data to be converted to affection status. The file "indat" (datain.dat format) should contain the parameter information corresponding to inped. TOAFF requires no parameters. To run TOAFF type "toaff" on the command line. For convenience copy the output files "outped" into pedin.dat and "outdat" into datain.dat.
  3. Partition the ordered markers into 2, 3, and 4 point analyses. If n is the number of individuals and m is the number of loci to be analyzed jointly, then GenoCheck requires n*m more likelihood evaluations beyond finding the maximum likelihood estimate of the recombination fractions. Therefore, in general, if the recombination fractions can be estimated using 2-point analysis, then error checking is possible using 2-point analysis or if the recombination fractions can be estimated using 3-point analysis then error checking is possible using a 3-point analysis.
  4. Assume the published order for the markers to be checked for error or find the most likely order.
  5. Create a subdirectory for each error analysis. Each subdirectory should contain a pedin.dat and datain.dat file (markers in the affection status format).
  6. Use lcp to create a script for each run. The guide below will assist you with the options. The ouput of lcp is a script named pedin.
         Pedigree Options:  General Pedigrees
         General Pedigree Analysis Options:  ILINK
         ILINK - Order Options:  Specific order
         ILINK - Sex Difference Options:  No sex difference
         ILINK - Locus Order Specification:  (Specify the most likely
                                              order with recombination
                                              fractions equal to 0.1
                                              or the published values
                                              if available.)
        
  7. Run suberr. This command uses the file pedin created in step (6) and creates the executable file pedinerr which contains the commands needed to run ILINKERR instead of ILINK.
  8. Run pedinerr. The file PosError will contain the error checking results.


Interpreting an Error Checking Run

In the file PosError, the test statistics are separated by locus within each pedigree. Within each pedigree and locus, each individual is listed with its associated test statistic in order of decreasing test statistic. As briefly described above, test statistics with relatively large values are indicative of an unlikely genotype for that individual at that locus. Test statistics greater than 0.0 are of particular interest. Note that test statistics are not comparable across different pedigrees or loci. In the presence of multiple errors, the program is likely to catch only some errors. Therefore correcting any errors and rerunning the program is very important.

The ordered list of individuals within pedigree and locus given in PosError should be thought of as a priority list for retyping. Interpreting an error checking run includes the following steps:

  1. Reread gels and check computer file entries for individuals in the top 20% of the locus lists within each pedigree. If no errors are found and all the test statistics are less than 0 then stop error checking. If there are any errors, correct them, run the analysis again, and go to step 2.
  2. Retype each individual in the top 10% of the locus lists within each pedigree. If there are no errors, then stop error checking. If errors are present, correct them and run the analysis again.



NOTE: To use GenoCheck at NIH, please contact CIT/DCB/BIMAS. GenoCheck requires a customized executable for most datasets. BIMAS will examine your dataset(s) and create the needed GenoCheck executable(s) for you. GenoCheck will run on helix, and other UNIX workstations.