Published on *Computer Science Department at Princeton University* (https://www.cs.princeton.edu)

Report ID:

TR-678-03

Authors:

Date:

September 2003

Pages:

11

Download Formats:

This paper is concerned with the reconstruction of perfect phylogenies

from binary character data with missing values, and related problems

of inferring complete haplotypes from haplotypes or genotypes with

missing data. In cases where the problems considered are NP-hard we

assume a rich data hypothesis under which they become

tractable. Natural probabilistic models are introduced for the

generation of character vectors, haplotypes or genotypes with missing

data, and it is shown that these models support the rich data

hypothesis. The principal results include:- A near-linear time algorithm for inferring a perfect phylogeny from

binary character data (or haplotype data) with missing values, under

the rich data hypothesis;- A quadratic-time algorithm for inferring a perfect phylogeny from

genotype data with missing values with high probability, under certain

distributional assumptions;- Demonstration that the problems of maximum-likelihood inference of

complete haplotypes from partial haplotypes or partial genotypes can

be cast as minimum-entropy disjoint set cover problems;- In the case where the haplotypes come from a perfect phylogeny, a

representation of the set cover problem as minimum-entropy covering of

subtrees of a tree by nodes;- An exact algorithm for minimum-entropy subtree covering, and

demonstration that it runs in polynomial time when the subtrees have

small diameter;- Demonstration that a simple greedy approximation algorithm solves

the minimum-entropy subtree covering problem with relative error

tending to zero when the number of partial haplotypes per complete

haplotype is large;- An asymptotically consistent method of estimating the frequencies of

the complete haplotypes in a perfect phylogeny, under an iid model for

the distribution of missing data;- Computational results on real data demonstrating the effectiveness

of a the greedy algorithm for inferring haplotypes from genotypes with

missing data, even in the absence of a perfect phylogeny.

**Links**

[1] https://www.cs.princeton.edu/research/techreps/author/419

[2] https://www.cs.princeton.edu/research/techreps/author/94

[3] ftp://ftp.cs.princeton.edu/techreports/2003/678.ps.gz

[4] ftp://ftp.cs.princeton.edu/techreports/2003/678.pdf