MTNS 2006 Paper Abstract

17th International Symposium on
Mathematical Theory of Networks and Systems
Kyoto International Conference Hall, Kyoto, Japan, July 24-28, 2006

MTNS 2006 Paper Abstract

Paper MoP10.5

Vidyasagar, Mathukumalli (Tata Consultancy Services)

The 4M (Mixed Memory Markov Model) Algorithm for Stochastic Modelling Over a Finite Alphabet with Applications to Gene-Finding Algorithms

Scheduled for presentation during the Regular Session "Biological Systems" (MoP10), Monday, July 24, 2006, 17:00−17:25, Room 103

17th International Symposium on Mathematical Theory of Networks and Systems, July 24-28, 2006, Kyoto, Japan

This information is tentative and subject to change. Compiled on April 19, 2024

Keywords Biological systems analysis, Statistical learning, Stochastic systems

Abstract

In this paper, we study the problem of constructing models for a stationary stochastic process ${ Y_t }$ assuming values in a finite set $M := { 1, ldots , m }$. It is assumed that only a {em finite length sample path/} of the process is known, and not the full statistics of the process. Two kinds of problems are studied, namely: modelling for prediction, and modelling for classification. For the prediction problem, in a companion paper it is shown that a well-known approach of modelling the given process as a multi-step Markov process is in fact the {em only/} solution satisfying certain nonnegativity constraints. In the present paper, accuracy and confidence bounds are derived for the parameters of this multi-step Markov model. So far as the author is aware, such bounds have not been published previously. For the classification problem, it is assumed that two distinct sets of sample paths of two separate stochastic processes are available -- call them ${ u_1 , ldots , u_r }$ and ${ v_1 , ldots , v_s }$. The objective here is to develop not one but {em two/} models, called $C$ and $NC$ respectively, such that the strings $u_i$ have much larger likelihoods with the model $C$ than with the model $NC$, and the opposite is true for the strings $v_j$. Then a new string $w$ is classified into the set $C$ or $NC$ according as its likelihood is larger from the model $C$ or the model $NC$. For the classification problem, we develop a new algorithm called the 4M (Mixed Memory Markov Model) algorithm, which is an improvement over variable length Markov models. We then apply the 4M algorithm to the problem of finding genes from the genome. The performance of the 4M algorithm is compared against that of the popular {tt Glimmer} algorithm. In most of the test cases studied, the 4M algorithm correctly classifies both coding as well as non-coding regions more than 90% of the time. Moreover, the accuracy of the 4M algorithm compares well with that of {tt Glimmer}. At the same time, the 4M algorithm is amenable to statistical analysis.