As the correct replication of results is still dependent on matching software versions, we recommend utilizing RepeatFS with software managed by a virtual or container environment, such as Anaconda or Docker. When used in this manner, RepeatFS is an easy-to-use tool for ensuring reproducibility for virtually any type of informatics analysis. Afgan E. Nucleic Acids Res. Google Scholar. Altschul S. Anaconda Software Distribution. Baker M. Nature , , — Bankevich A.
Bolger A. Bioinformatics , 30 , — Google Preview. Cock P. Bioinformatics , 25 , — Coiera E. Davis-Turak J. Expert Rev. Felsenstein J. Garijo D. PLoS One , 8 , e Goodstadt L. Bioinformatics , 26 , — Gordon A. Kanwal S. BMC Bioinformatics , 18 , Katoh K. Kim Y. GigaScience , 7 , giy Bioinformatics , 28 , — Leinonen R. Lewis J. BMC Syst. Quast C. Sadedin S. Seemann T.
Stamatakis A. Thompson J. Wang G. PLoS Comput. Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide. Sign In or Create an Account. Sign In. Advanced Search. Search Menu. Article Navigation. Close mobile search navigation Article Navigation. Volume Article Contents Abstract.
RepeatFS: a file system providing reproducibility through provenance and automation. Anthony Westbrook , Anthony Westbrook. Department of Computer Science. To whom correspondence should be addressed. Oxford Academic. Elizabeth Varki. W Kelley Thomas. Hubbard Center for Genome Studies. Revision received:. Editorial decision:. Select Format Select format. Permissions Icon Permissions. Abstract Motivation. Table 1. Provenance complexity. IO Ops. Open in new tab.
Open in new tab Download slide. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: update. Google Scholar Crossref. Search ADS. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Biopython: freely available Python tools for computational molecular biology and bioinformatics.
Genomics pipelines and data integration: challenges and opportunities in the research setting. Quantifying reproducibility in computational biology: the case of the tuberculosis drugome. They can automate and manage complex processes and huge amounts of data produced by petas- cale simulations.
Typically, the produced data need to be properly visualized and analyzed by scientists in order to achieve the desired scientific goals. Both run-time and post analysis may benefit from, even require, additional meta-data provenance information. One of the challenges in this context is the tracking of the data files that can be produced in very large numbers during stages of the workflow, such as visualizations.
The Kepler provenance framework collects all or part of the raw information flowing through the workflow graph. This infor- mation then needs to be further parsed to extract meta-data of interest. In the general case, this is not a trivial task, due to the huge diversity in scientific workflows command-line or GUI, interactive or batch-jobs, solo or collaborative and in computing environments from laptops to supercomputers.
A wide variety of software tools has been developed to support reproducible research and provenance tracking in computational research. Each of these tools takes, in general, one of three approaches - literate programming, workflow management systems, or environment capture. Literate programming, introduced by Donald Knuth in the s, interweaves text in a natural language, typically a description of the program logic, with source code. This is obviously useful for scientific provenance tracking, since the code and the results are inextricably bound together.
With most systems it is also possible for the system to automatically include information about software versions, hardware configuration and input data in the final document. This is not to say that the literate programming approach will not prove to be a good one in these scenarios IPython includes good support for parallelism, for example, and many tools provide support for caching of results so only the code that has changed needs to be re-run , but the current generation of tools are generally more difficult to use in these scenarios.
Where literate programming focuses on expressing computations through code, workflow management systems express computations in terms of higher-level components, each performing a small, well-defined task with well defined inputs and outputs, joined together in a pipeline. Obviously, there is code underlying each component, but for the most part this is expected to be code written by someone else, the scientist uses the system mainly by connecting together pre-existing components, or by wrapping existing tools command-line tools, webservices, etc.
Workflow management systems are popular in scientific domains where there is a certain level of standardization of data formats and analysis methods - for example in bioinformatics and any field that makes extensive use of image processing. The main disadvantage is that where there are no pre-existing components, nor easily-wrapped tools command-line tools or webservices , for a given need, writing the code for a new component can be rather involved and require detailed knowledge of the workflow system architecture.
The simplest approach is to capture the entire operating system in the form of a virtual machine VM image. When other scientists wish to replicate your results, you send them the VM image together with some instructions, and they can then load the image on their own computer, or run it in the cloud.
An interesting tool which supports a more lightweight approach in terms of filesize than capturing an entire VM image, and which furthermore does not require use of virtualization technology, is CDE.
CDE works only on Linux, but will work with any command-line launched programme. After installing CDE, prepend your usual command-line invocation with the cde command, e. CDE will run the programs as usual, but will also automatically detect all files executables, libraries, data files, etc. This package can then be unpacked on any modern x86 Linux machine and the same commands run, using the versions of libraries and other files contained in the package, not those on the new system.
An alternative to capturing the entire experiment context code, data, environment as a binary snapshot is to capture all the information needed to recreate the context. This approach is taken by Sumatra. Sumatra is developed by the author of this tutorial.
I aim to be objective, but cannot guarantee this! An interesting approach would be to combine Sumatra and CDE, so as to have both information about the code, data, libraries, etc. I will now give a more in-depth introduction to Sumatra. It can be used directly in your own Python code or as the basis for interfaces that work with non-Python code. Currently there is a command line interface smt , which is mainly for configuring your project and launching computations, and a web interface which is mainly for browsing and inspecting both the experiment outputs and the captured provenance information.
Suppose you already have a simulation project, and are using Mercurial for version control. In your working directory which contains your Mercurial working copy , use the smt init command to start tracking your project with Sumatra.
This creates a sub-directory,. To see a list of simulations you have run, use the smt list command:. Each line of output is the label of a simulation record.
Labels should be unique within a project. By default, Sumatra generates a label for you based on the timestamp, but it is also possible to specify your own label, as well as to add other information that might be useful later, such as the reason for performing this particular simulation:.
After the simulation has finished, you can add further information, such as a qualitative assessment of the outcome, or a tag for later searching:. This adds the comment and tag to the most recent simulation record.
0コメント