Public Dataset List

Below are a lot of public datasets https://www.openml.org/search?type=data http://archive.ics.uci.edu/ml/datasets.php https://www.re3data.org/ https://www.data.gov/ https://www.kdnuggets.com/datasets/index.html http://dataportals.org/  

Using NN to perform Genome Assembly

A Machine Learning Approach to DNA Shotgun Sequence Assembly. 2015. DNA FRAGMENT ASSEMBLY USING NEURAL PREDICTION TECHNIQUES. 1999. The main idea is to use NN for read prediction. For the reads with same prediction pattern, we cluster them into several parts and use the existing assemblers to assemble each part.

Genome assembly by Statistician

GAML: genome assembly by maximum likelihood (2015) Bayesian Genome Assembly and Assessment by Markov Chain Monte Carlo Sampling (2014) ILP-based maximum likelihood genome scaffolding (2014) Toward a statistically explicit understanding of de novo sequence assembly (2013) CGAL: computing genome assembly likelihoods (2013) Denovo likelihood-based measures for comparing genome assemblies (2013) ALE: a generic assembly likelihood Read more…

Large genome assembly

ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter (2017) departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements. We benchmarked ABySS 2.0 human genome assembly using a Genome in a Read more…

Improve assembly

A comparative evaluation of genome assembly reconciliation tools (2017) benchmarked seven assembly reconciliation tools, namely CISA, GAA, GAM_NGS, GARM, Metassembler, MIX, and ZORRO Despite the inability of these assembly tools to solve the general assembly reconciliation problem, each tool demonstrated some strengths that could lead to algorithmic advances for this problem. For instance, CISA generally was Read more…

Genome assembly evaluation

SuRankCo: supervised ranking of contigs in de novo assemblies (2015) A machine learning approach to predict quality scores for contigs and to enable the ranking of contigs within an assembly. Information on characteristics of contigs from a de novo assembly are extracted by the SuRankCo-Feature module. These features include common characteristics such as length Read more…