![]()
During the research project, we study another interesting application of compressibility: to measure the "relatedness" between two DNA sequences or two genomes. We formulate a symmetric distance between two sequences using a beautiful theorem from Kolmogorov Complexity.
For two sequences x and y, We define sequence distance by d(x,y)=1-(K(x)-K(x|y))/K(xy), where K(x|y) is the conditional Kolmogorov complexity of x given y which is defined as the length of the shortest program that on input y outputs x. It has been proved that this definition satisfies all the four conditions a distance measurement requires, and that it is a universal distance.
We use Compress(x) to heuristically approximate K(x) and use Compress(x|y) to approximate K(x|y) since Kolmogorov complexity is not computable, where Compress(x) is the length of x after compressed by GenCompress algorithm. We can see that experimental results will also support the theory of sequence distance.
It has not escaped out notice that the distance measure we have postulated can be immediately used to construct evolutionary trees from DNA sequences, especially those that cannot be aligned, such as complete genomes. Here you can find a rRNA tree built from the above experimental results.