Counting Regular Expressions in Degenerated Sequences through Lazy Markov Chain Embedding.

Saved in:
Bibliographic Details
Title: Counting Regular Expressions in Degenerated Sequences through Lazy Markov Chain Embedding.
Authors: Nuel, G.1 gregory.nuel@parisdescartes.fr, Delos, V.1
Source: Annual International Conference on Computational Mathematics, Computational Geometry & Statistics. 2014, p100-108. 9p.
Subjects: Sequential machine theory, Cellular automata, Markov processes, Algorithm research, Embeddings (Mathematics)
Abstract: Nowadays, Next Generation Sequencing (NGS) produce huge number of reads which are combined using multiple alignment tech-niques to produce sequences. During this pro-cess, many sequencing errors are corrected, but the resulting sequences nevertheless contain a marginal level of uncertainty in the form of ~ 0.1% or less of degenerated positions (like the letter 'N' corresponding to any nucleotide). A previous work [7] showed that these degen-erated letters might lead to erroneous counts when performing pattern matching on these se-quences. An algorithm based on Determinis-tic Finite Automata (DFA) and Markov Chain Embedding (MCE) was suggested to deal with this problem. In this paper, we introduce a new version of this algorithm which uses Nondeterministic Fi-nite Automata (NFA) rather than DFA to per-form what we call "lazy MCE". This new ap-proach proves itself much faster than the previ-ous one and we illustrate its usefulness on two NGS datasets and a selection of regular expres-sions. A software implementing this al-gorithm is available: countmotif, http ://www.math-info.univ-paris5. [ABSTRACT FROM AUTHOR]
Copyright of Annual International Conference on Computational Mathematics, Computational Geometry & Statistics is the property of Global Science & Technology Forum and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Engineering Source
Description
Abstract:Nowadays, Next Generation Sequencing (NGS) produce huge number of reads which are combined using multiple alignment tech-niques to produce sequences. During this pro-cess, many sequencing errors are corrected, but the resulting sequences nevertheless contain a marginal level of uncertainty in the form of ~ 0.1% or less of degenerated positions (like the letter 'N' corresponding to any nucleotide). A previous work [7] showed that these degen-erated letters might lead to erroneous counts when performing pattern matching on these se-quences. An algorithm based on Determinis-tic Finite Automata (DFA) and Markov Chain Embedding (MCE) was suggested to deal with this problem. In this paper, we introduce a new version of this algorithm which uses Nondeterministic Fi-nite Automata (NFA) rather than DFA to per-form what we call "lazy MCE". This new ap-proach proves itself much faster than the previ-ous one and we illustrate its usefulness on two NGS datasets and a selection of regular expres-sions. A software implementing this al-gorithm is available: countmotif, http ://www.math-info.univ-paris5. [ABSTRACT FROM AUTHOR]
ISSN:22511911
DOI:10.5176/2251-1911_CMCGS14.28