<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="other" dtd-version="1.2" xml:lang="en">
    <front>
        <journal-meta>
            <journal-id journal-id-type="pmc">Gates Open Res</journal-id>
            <journal-title-group>
                <journal-title>Gates Open Research</journal-title>
            </journal-title-group>
            <issn pub-type="epub">2572-4754</issn>
            <publisher>
                <publisher-name>F1000 Research Limited</publisher-name>
                <publisher-loc>London, UK</publisher-loc>
            </publisher>
        </journal-meta>
        <article-meta>
            <article-id pub-id-type="doi">10.12688/gatesopenres.12832.1</article-id>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Software Tool Article</subject>
                </subj-group>
                <subj-group>
                    <subject>Articles</subject>
                </subj-group>
            </article-categories>
            <title-group>
                <article-title>DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis</article-title>
                <fn-group content-type="pub-status">
                    <fn>
                        <p>[version 1; peer review: 1 approved with reservations]</p>
                    </fn>
                </fn-group>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Finak</surname>
                        <given-names>Greg</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Funding Acquisition</role>
                    <role content-type="http://credit.niso.org/">Investigation</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="corresp" rid="c1">a</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                    <xref ref-type="aff" rid="a2">2</xref>
                    <xref ref-type="aff" rid="a3">3</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Mayer</surname>
                        <given-names>Bryan</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Data Curation</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                    <xref ref-type="aff" rid="a2">2</xref>
                    <xref ref-type="aff" rid="a3">3</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Fulp</surname>
                        <given-names>William</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Data Curation</role>
                    <role content-type="http://credit.niso.org/">Formal Analysis</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                    <xref ref-type="aff" rid="a2">2</xref>
                    <xref ref-type="aff" rid="a3">3</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Obrecht</surname>
                        <given-names>Paul</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Data Curation</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                    <xref ref-type="aff" rid="a2">2</xref>
                    <xref ref-type="aff" rid="a3">3</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Sato</surname>
                        <given-names>Alicia</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Project Administration</role>
                    <role content-type="http://credit.niso.org/">Resources</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                    <xref ref-type="aff" rid="a2">2</xref>
                    <xref ref-type="aff" rid="a3">3</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Chung</surname>
                        <given-names>Eva</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Project Administration</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                    <xref ref-type="aff" rid="a2">2</xref>
                    <xref ref-type="aff" rid="a3">3</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Holman</surname>
                        <given-names>Drienna</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Data Curation</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                    <xref ref-type="aff" rid="a2">2</xref>
                    <xref ref-type="aff" rid="a3">3</xref>
                </contrib>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Gottardo</surname>
                        <given-names>Raphael</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Funding Acquisition</role>
                    <role content-type="http://credit.niso.org/">Investigation</role>
                    <role content-type="http://credit.niso.org/">Project Administration</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="corresp" rid="c2">b</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                    <xref ref-type="aff" rid="a2">2</xref>
                    <xref ref-type="aff" rid="a3">3</xref>
                </contrib>
                <aff id="a1">
                    <label>1</label>Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA</aff>
                <aff id="a2">
                    <label>2</label>Vaccine Immunology Statistical Center, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA</aff>
                <aff id="a3">
                    <label>3</label>Statistical Center For HIV AIDS Research and Prevention, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA</aff>
            </contrib-group>
            <author-notes>
                <corresp id="c1">
                    <label>a</label>
                    <email xlink:href="mailto:gfinak@fredhutch.org">gfinak@fredhutch.org</email>
                </corresp>
                <corresp id="c2">
                    <label>b</label>
                    <email xlink:href="mailto:rgottard@fredhutch.org">rgottard@fredhutch.org</email>
                </corresp>
                <fn fn-type="conflict">
                    <p>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>22</day>
                <month>6</month>
                <year>2018</year>
            </pub-date>
            <pub-date pub-type="collection">
                <year>2018</year>
            </pub-date>
            <volume>2</volume>
            <elocation-id>31</elocation-id>
            <history>
                <date date-type="accepted">
                    <day>19</day>
                    <month>6</month>
                    <year>2018</year>
                </date>
            </history>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2018 Finak G et al.</copyright-statement>
                <copyright-year>2018</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <self-uri content-type="pdf" xlink:href="https://gatesopenresearch.org/articles/2-31/pdf"/>
            <abstract>
                <p>A central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large &#x2019;omics&#x2019; or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high "startup" costs and poor adherence to procedures when they deviate substantially from an analyst&#x2019;s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual&#x2019;s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R&#x2019;s package system combined with a new tool 
                    <italic toggle="yes">DataPackageR</italic>, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the 
                    <italic toggle="yes">DataPackageR</italic> tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.</p>
            </abstract>
            <kwd-group kwd-group-type="author">
                <kwd>Assay data</kwd>
                <kwd>Bioconductor</kwd>
                <kwd>Collaboration</kwd>
                <kwd>Data science</kwd>
                <kwd>Package</kwd>
                <kwd>Reproducibility</kwd>
                <kwd>Rmarkdown</kwd>
                <kwd>Rstats</kwd>
                <kwd>Version control</kwd>
            </kwd-group>
            <funding-group>
                <award-group id="fund-1" xlink:href="http://dx.doi.org/10.13039/100000865">
                    <funding-source>Gates Foundation</funding-source>
                    <award-id>OPP1151646</award-id>
                </award-group>
                <award-group id="fund-2" xlink:href="http://dx.doi.org/10.13039/100000057">
                    <funding-source>National Institute of General Medical Sciences</funding-source>
                    <award-id>R01GM118417-01A1</award-id>
                </award-group>
                <funding-statement>This work was supported by the Gates Foundation [OPP1032317; to RG]; and the National Institute of General Medical Sciences [R01 GM118417-01A1; to GF].</funding-statement>
                <funding-statement>
                    <italic>The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</italic>
                </funding-statement>
            </funding-group>
        </article-meta>
    </front>
    <body>
        <sec sec-type="intro">
            <title>Introduction</title>
            <p>A central idea of reproducible research is that results are published along with underlying data and software code necessary to reproduce and verify the findings. Termed a 
                <italic toggle="yes">research compendium</italic>, this idea has received significant attention in the literature
                <sup>
                    <xref ref-type="bibr" rid="ref-1">1</xref>&#x2013;
                    <xref ref-type="bibr" rid="ref-5">5</xref>
                </sup>.</p>
            <p>Many software tools have since been developed to facilitate reproducible data analytics, and scientific journals have increasingly demanded that code and primary data be made publicly available with scientific publications
                <sup>
                    <xref ref-type="bibr" rid="ref-2">2</xref>,
                    <xref ref-type="bibr" rid="ref-6">6</xref>&#x2013;
                    <xref ref-type="bibr" rid="ref-27">27</xref>
                </sup>. Tools like git and Github, figshare, and Rmarkdown are increasingly used by researchers to make code, figures, and data open, accessible and reproducible. Nonetheless, in the life sciences, practicing reproducible research with large data sets and complex processing pipelines continues to be challenging.</p>
            <p>Data preprocessing, quality control (QC), data standardization, analysis, and reporting are tightly coupled in most discussions of reproducible research, and indeed, literate programming frameworks such as Sweave and Rmarkdown are designed around the idea that code, data, and research results are tightly integrated
                <sup>
                    <xref ref-type="bibr" rid="ref-2">2</xref>,
                    <xref ref-type="bibr" rid="ref-25">25</xref>
                </sup>. Tools like Docker, a software container that virtualizes an operating system environment for distribution, have been used to ensure consistent versions of software and other dependencies are used for reproducible data analysis
                <sup>
                    <xref ref-type="bibr" rid="ref-16">16</xref>
                </sup>. The use of R in combination with other publicly available tools has been proposed in the past to build reproducible research compendia
                <sup>
                    <xref ref-type="bibr" rid="ref-3">3</xref>,
                    <xref ref-type="bibr" rid="ref-28">28</xref>,
                    <xref ref-type="bibr" rid="ref-29">29</xref>
                </sup>. Many existing tools already implement such ideas. The 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/jdblischak/workflowr#quick-start">
                    <italic toggle="yes">workflowr</italic> package</ext-link> provides mechanisms to turn a data analysis project into a version-controlled, documented, website presenting the results. The 
                <italic toggle="yes">drake</italic> package
                <sup>
                    <xref ref-type="bibr" rid="ref-30">30</xref>
                </sup> is a general purpose work-flow manager that implements analytic "plans", caching of intermediate data objects, and provides scalability, and provides tangible evidence of reproducibility by detecting when code, data and results are in sync.</p>
            <p>However, tight coupling of preprocessing and analysis can be challenging for teams analyzing and integrating large volumes of diverse data, where different individuals in the team have different areas of expertise and may be responsible for processing different data sets from a larger study. These challenges are compounded when a processing pipeline is split across multiple teams. A primary problem in data science is the programmatic integration of software tools with dynamic data sources.</p>
            <p>Here, we argue that data processing, QC, and analysis can be treated as modular components in a reproducible research pipeline. For some data types, it is already common practice to factor out the processing and QC from the data analysis. For rnaseq data, for example, it is clearly impractical and time consuming to re-run monolithic code that performs alignment, QC, gene expression quantification, and analysis each time the downstream analysis is changed. Our goal is to ensure that downstream data analysis maintains dependencies on upstream raw data and processing but that the processed data can be efficiently distributed to users in an independent manner and updated when there are changes.</p>
            <p>Here, we present how the Vaccine Immunology Statistical Center (VISC) at the Fred Hutchinson Cancer Research Center has addressed this problem and implemented a reproducible research work-flow that scales to medium-sized collaborative teams by leveraging free and open source tools, including R, Bioconductor and git
                <sup>
                    <xref ref-type="bibr" rid="ref-22">22</xref>,
                    <xref ref-type="bibr" rid="ref-31">31</xref>
                </sup>.</p>
        </sec>
        <sec sec-type="methods">
            <title>Methods</title>
            <sec>
                <title>Operation</title>
                <p>In order to use 
                    <italic toggle="yes">DataPackageR</italic> an R installation (&#x2265;3.5.0) is required. Associated dependencies are listed in the package&#x2019;s DESCRIPTION file and are automatically installed when using the 
                    <italic toggle="yes">install_github()</italic> API from the 
                    <italic toggle="yes">devtools</italic> package. There are no minimum memory, CPU, or storage requirements apart from what is necessary to perform data processing, which varies on a case&#x2013;by-case basis.</p>
            </sec>
            <sec>
                <title>Implementation</title>
                <p>Our work-flow is built around the 
                    <italic toggle="yes">DataPackageR</italic> R package, which provides a framework for decoupling data preprocessing from data analysis, while maintaining traceability and data provenance
                    <sup>
                        <xref ref-type="bibr" rid="ref-18">18</xref>
                    </sup>.</p>
                <p>

                    <italic toggle="yes">DataPackageR</italic> builds upon the features already provided by the R package system. R packages provide a convenient mechanism for including documentation as part of the built-in help system, as well as long-form 
                    <monospace>vignettes</monospace>, and version information and distribution of the entire package. Importantly, R packages often include data stored as R objects, and some packages, particularly under BioConductor, are devoted solely to the distribution of data sets
                    <sup>
                        <xref ref-type="bibr" rid="ref-22">22</xref>
                    </sup>. The accepted mechanism for such distribution is to store R objects as 
                    <monospace>rda</monospace> files in the 
                    <monospace>data</monospace> directory of the package source tree and to store the source code used to produce those data sets in 
                    <monospace>data-raw</monospace>. The 
                    <italic toggle="yes">devtools</italic> package provides some mechanisms to process the source code into stored data objects
                    <sup>
                        <xref ref-type="bibr" rid="ref-32">32</xref>
                    </sup>.</p>
                <p>Data processing code provided by the user (in the form of Rmd files preferably, and R files optionally) is run and the results are automatically included as package vignettes, with output data sets (specified by the user) included as data objects in the package. Notably, this process, while apparently mirroring much of the existing R package build process, is disjointed from it, thereby allowing the decoupling of computationally long or expensive data processing from routine package building and installation. This allows 
                    <italic toggle="yes">DataPackageR</italic> to decouple data munging and tidying from data analysis while maintaining data provenance in the form of a vignette in the final package where the end-user can view and track how individual data sets have been processed. This is particularly useful for large or complex data that involve extensive preprocessing of primary or raw data (e.g. alignment of fastq files for rnaseq or gating of fcm data), and where computation may be prohibitively long or involve software dependencies not immediately available to the end-user.</p>
                <p>

                    <italic toggle="yes">DataPackageR</italic> implements these features on top of a variety of 
                    <italic toggle="yes">tidyverse</italic> tools including 
                    <italic toggle="yes">devtools, roxygen2, rmarkdown, utils, yaml, purrr</italic>. The complete list of package dependencies is in the package 
                    <monospace>DESCRIPTION</monospace> file.</p>
            </sec>
            <sec>
                <title>Package structure</title>
                <p>To construct a data package using 
                    <italic toggle="yes">DataPackageR</italic>, the user invokes the 
                    <monospace>datapackage.skeleton()</monospace> API, which behaves like R&#x2019;s 
                    <monospace>
                        <italic toggle="yes">package.skeleton()</italic>
                    </monospace>, creating the necessary directory structure with some modifications. A listing of the structure of 
                    <italic toggle="yes">DataPackageR</italic> skeleton package directory, with other associated files is shown below:</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;">package root
|--- datapackager.yml # Configuration file controlling
|                     # the package build process.
|--- DESCRIPTION # Adds a DataVersion  
|                # string to version the 
|                # data set.
|--- NAMESPACE
|--- DATADIGEST # Stores an MD5 hash of each 
|               # data object in the package.
|--- R
|--- Read-and-delete-me.txt    # Further instructions  
|                              # on building the package.
|--- data             # Holds processed, analysis-ready data objects.
|--- data-raw         # User code for data  
|                     # processing is placed here by
|                     # datapackage.skeleton().
|--- documentation.R # Auto generated roxygen documentation
|                    # for data set objects.
|--- inst
| |___ extdata # (small) raw data files.
| |___ doc   # Processed vignettes are moved here.
|            # Data processing code is accessible in the
|            # final package via the vignette() API.
|--- vignettes # Scripts in data-raw
|              # are processed into vignettes.
|___ man # Autogenerated documentation is processed 
         # into rd files. </styled-content>
                    </preformat>
                </p>
                <p>The 
                    <monospace>datapackage.skeletion</monospace> API takes several new arguments apart from the package 
                    <italic toggle="yes">name</italic>. First, 
                    <monospace>
                        <italic toggle="yes">code_files</italic>
                    </monospace> takes a vector of paths to 
                    <italic toggle="yes">Rmd</italic> or 
                    <italic toggle="yes">R</italic> scripts that perform the data processing. These are moved into the 
                    <monospace>data-raw</monospace> directory by 
                    <monospace>
                        <italic toggle="yes">package.skeleton()</italic>
                    </monospace>. The argument 
                    <monospace>
                        <italic toggle="yes">r_object_names</italic>
                    </monospace> takes a vector of quoted R object names. These are objects that are to be stored in the final package and it is expected that they are created by the code in the 
                    <italic toggle="yes">R</italic> or 
                    <italic toggle="yes">Rmd</italic> files. These can be 
                    <italic toggle="yes">tidy</italic> data tables, or arbitrary R objects and data structures (e.g. 
                    <italic toggle="yes">S4</italic> objects) that will be consumed by the package end-user. Information about the processing scripts and data objects is stored in a configuration file named 
                    <monospace>datapackager.yml</monospace> in the package root directory and only used by the package build process. The scripts may read raw data from any location, but generally the package maintainer should place it in 
                    <monospace>inst/extdata</monospace> if file size is not prohibitive for distribution.</p>
            </sec>
            <sec>
                <title>The build_package API</title>
                <p>Once code and data are in place, the 
                    <monospace>build_package()</monospace> API invokes the build process. This API is the only way to invoke the execution of code in 
                    <monospace>data-raw</monospace> to produce data sets stored in 
                    <monospace>data</monospace>. It is not invoked through R&#x2019;s standard 
                    <monospace>R CMD build</monospace> or 
                    <monospace>R CMD INSTALL</monospace> APIs, thereby decoupling long and computationally intensive processing from the standard build process invoked by end-users. Upon invocation of 
                    <monospace>build_package()</monospace> the 
                    <italic toggle="yes">R</italic> and 
                    <italic toggle="yes">Rmd</italic> files specified in 
                    <monospace>datapackager.yml</monospace> will be compiled into 
                    <italic toggle="yes">package vignettes</italic> and moved into the 
                    <monospace>inst/doc</monospace> directory, data objects will be created and moved into 
                    <monospace>data</monospace>, data objects will be version tagged with their 
                    <italic toggle="yes">checksum</italic> and recorded in the 
                    <italic toggle="yes">DATADIGEST</italic> file in the package root, and a 
                    <italic toggle="yes">roxygen</italic> markup skeleton will be created for each data object in the package.</p>
            </sec>
            <sec>
                <title>YAML configuration</title>
                <p>The 
                    <monospace>datapackager.yml</monospace> configuration file in the package root controls the build process by specifying which 
                    <monospace>R</monospace> and 
                    <monospace>Rmd</monospace> files should be processed and which 
                    <monospace>named R objects</monospace> are expected to be included as data sets in the package. The listing below shows the structure of the YAML configuration file used by 
                    <italic toggle="yes">DataPackageR</italic> to control compilation and inclusion of data objects in the package:</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000; font-style:italic">configuration:
  files:                             # files property lists 
    process_dataset_one.Rmd:         # R or Rmd code files 
      name: process_dataset_one.Rmd  # Each file has a name
      enabled: yes                   # The enabled property specifies
                                     # if the file should be processed
    process_dataset_two.Rmd:
      name: process_dataset_two.Rmd
      enabled: yes
  objects:                           # A list of the data objects created
  - dataset_one                      # by processing the files. 
  - dataset_two
  - dataset_three</styled-content>
                    </preformat>
                </p>
                <p>The API for interacting with this file is outlined in 
                    <xref ref-type="table" rid="T1">Table 1</xref>.</p>
                <table-wrap id="T1" orientation="portrait" position="float">
                    <label>Table 1. </label>
                    <caption>
                        <title>The API for interacting with a YAML config file used by DataPackageR allows the user to add and remove data objects and code files, toggle compilation of files, and read and write the configuration to the data package.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1">API call</th>
                                <th align="left" colspan="1" rowspan="1">Property</th>
                                <th align="left" colspan="1" rowspan="1">return value</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>yml_add_files()</monospace>
                                </td>
                                <td align="left" colspan="1" rowspan="1">file</td>
                                <td align="left" colspan="1" rowspan="1">config object</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>yml_remove_files()</monospace>
                                </td>
                                <td align="left" colspan="1" rowspan="1">file</td>
                                <td align="left" colspan="1" rowspan="1">config object</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>yml_add_objects()</monospace>
                                </td>
                                <td align="left" colspan="1" rowspan="1">objects</td>
                                <td align="left" colspan="1" rowspan="1">config object</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>yml_remove_objects()</monospace>
                                </td>
                                <td align="left" colspan="1" rowspan="1">objects</td>
                                <td align="left" colspan="1" rowspan="1">config object</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>yml_find()</monospace>
                                </td>
                                <td align="left" colspan="1" rowspan="1">config file</td>
                                <td align="left" colspan="1" rowspan="1">config object</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>yml_write()</monospace>
                                </td>
                                <td align="left" colspan="1" rowspan="1">config file</td>
                                <td align="left" colspan="1" rowspan="1">null</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>yml_enable_compile()</monospace>
                                </td>
                                <td align="left" colspan="1" rowspan="1">enable</td>
                                <td align="left" colspan="1" rowspan="1">config object</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>yml_disable_compile()</monospace>
                                </td>
                                <td align="left" colspan="1" rowspan="1">enable</td>
                                <td align="left" colspan="1" rowspan="1">config object</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
            </sec>
            <sec>
                <title>Dataset versioning</title>
                <p>During the build, the 
                    <monospace>DATADIGEST</monospace> file is auto-generated. This file contains an 
                    <monospace>md5</monospace> hash of each data object stored in the package as well as an overall data set version string. These hashes are checked when the package is rebuilt; if they do not match, it indicates the format of the processed data has changed (either because the primary data has changed, or because the processing code has changed to update the data set). In these cases, the 
                    <italic toggle="yes">DATADIGEST</italic> for the changed object is updated and the minor version of the 
                    <monospace>DataVersion</monospace> string in the 
                    <monospace>DESCRIPTION</monospace> file is automatically incremented. The DataVersion for a package can be checked by the 
                    <monospace>dataVersion()</monospace> API, allowing end-users to produce reports based on the expected version of a data set (
                    <xref ref-type="fig" rid="f1">Figure 1</xref>).</p>
                <fig fig-type="figure" id="f1" orientation="portrait" position="float">
                    <label>Figure 1. </label>
                    <caption>
                        <title>A schematic overview of the components in our reproducible data packaging work-flow that decouples data processing from data analysis.</title>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://gatesopenresearch-files.f1000.com/manuscripts/13908/16d6e274-badd-44e4-be79-b9df0e52e603_figure1.gif"/>
                </fig>
            </sec>
            <sec>
                <title>Data documentation</title>
                <p>
                    <italic toggle="yes">DataPackageR</italic> ensures that documentation is available for each data object included in a package by automatically creating a 
                    <italic toggle="yes">roxygen</italic> markup stub for each object that can then be filled in by the user. Undocumented objects are explicitly excluded from the final package.</p>
                <p>Packages can be readily distributed in source or tarball form (together with the processed data sets under  
                    <monospace>/data/</monospace> and raw data sets under  
                    <monospace>/inst/extdata</monospace>). Within VISC we leverage 
                    <italic toggle="yes">git</italic> and 
                    <italic toggle="yes">github</italic> to provide version control of data package source code. By leveraging 
                    <italic toggle="yes">DataPackageR</italic>, the data processing is decoupled from the usual build process and does not need to be run by the end-user each time a package is downloaded and installed. Documentation in the form of 
                    <italic toggle="yes">Rd</italic> files, one for each data object in the package, as well as 
                    <italic toggle="yes">html vignettes</italic> describing the data processing, are included in the final package. These describe the data sets as well as how data was transformed, filtered, and otherwise processed from its raw state.</p>
            </sec>
        </sec>
        <sec>
            <title>Use cases</title>
            <p>

                <italic toggle="yes">DataPackageR</italic> was developed as a lightweight alternative to existing reproducible work-flow tools (e.g. 
                <italic toggle="yes">Galaxy</italic>
                <sup>
                    <xref ref-type="bibr" rid="ref-33">33</xref>
                </sup>), or to fully fledged database solutions that are often beyond the scope of most short-term projects. 
                <italic toggle="yes">DataPackageR</italic> plugs easily into any existing R-based data analysis work-flow, since existing data processing code needs only to be formatted into Rmarkdown (ideally). It is particularly suited for long-running or complex data processing tasks, or tasks that depend on large data sets that may not be available to the end user (e.g. FASTQ alignment or raw flow cytometry data processing). Such tasks do not fit well into the standard 
                <monospace>R CMD build</monospace> paradigm, for example either as vignettes or 
                <italic toggle="yes">.R</italic> files under 
                <monospace>/data</monospace> since these would be invoked each time an end user builds a package from source. We desire, however, to maintain a link between the processed data sets and the processing code that generates them. We note that 
                <italic toggle="yes">DataPackageR</italic> is distinct from other reproducible research frameworks such as 
                <italic toggle="yes">workflowr</italic> or 
                <italic toggle="yes">drake</italic>
                <sup>
                    <xref ref-type="bibr" rid="ref-30">30</xref>
                </sup>, in that it is designed to 
                <italic toggle="yes">reproducibly prepare data for analysis</italic>, using an 
                <italic toggle="yes">existing code base</italic>, with little additional effort. The product of 
                <italic toggle="yes">DataPackageR</italic> is nothing more than an R package that can be used by anyone. The resulting data packages are meant to be shared, to serve as the basis for further analysis (
                <xref ref-type="fig" rid="f1">Figure 1</xref>) and distributed as part of publications. These downstream analyses may leverage any of the existing work-flow management tools. Our goal is that data sets forming the basis of scientific findings can be confidently shared in their processed form which is often much smaller and easier to distribute.</p>
            <p>Within the VISC, a team of analysts, statistical programmers, and data managers work collaboratively to analyze pre-clinical data arising from multiple trials. There are multiple assays per trial. The challenges associated with ensuring the entire team works from the same version of a frequently changing and dynamic data set, motivated the development of 
                <italic toggle="yes">DataPackageR</italic>. The tool is routinely used to process and standardize trial data for submission to The Collaboration for AIDS Vaccine Discovery (CAVD) Data Space.</p>
            <p>We demonstrate how 
                <italic toggle="yes">DataPackageR</italic> is used to process and package multiple types of assay data from an animal trial of an experimental HIV vaccine.</p>
            <sec>
                <title>Data</title>
                <p>We demonstrate the use of 
                    <italic toggle="yes">DataPackageR</italic> for processing data from a vaccine study, named 
                    <italic toggle="yes">MX1</italic>, designed to examine the antibody responses to heterologous N7 Env prime-boost immunization in macaques. The study had four treatment groups plus a control arm, with six animals per group. Samples were collected at three time points: t1: baseline, post-prime 2, post-boost 1, t2: post-boost 2, t3: post-boost 3. Six assays were run at each time point, using either serum samples or peripheral blood mononuclear cells (PBMCs). The assays were: 1) enzyme-linked immunosorbent assay (ELISA), an immunological assay that enables detection of antibodies, antigens, proteins and/or glycoproteins (serum); 2) a neutralizing antibody (Ab) assay (serum); 3) a binding antibody multiplex assay (BAMA) to assess antibody response breadth (serum); 4) a BAMA assay to permit epitope mapping (serum); 5) an antibody dependent cellular cytotoxicity (ADCC) assay (serum); and 6) an intracellular cytokine staining assay to assess cellular responses (PBMCs).</p>
                <p>The raw data and environment to reproduce the processing with 
                    <italic toggle="yes">DataPackageR</italic> are distributed as a Docker image on 
                    <monospace>hub.docker.com</monospace> as 
                    <monospace>gfinak/datapackager:latest</monospace>. We have restricted the number of FCS files distributed in the container to limit the size of the image and speed up processing of FCM data for demonstration purposes.</p>
            </sec>
            <sec>
                <title>Flow cytometry and other assay data</title>
                <p>Flow cytometry (FCM) is a high content, high throughput assay for which VISC leverages specialized data processing and analytics tools. Raw FCS files and manual gate information in the form of FlowJo (FlowJo LLC, Ashland, OR) workspace files are uploaded directly to VISC by the labs. The raw data are processed with open source BioConductor software (
                    <italic toggle="yes">flowWorkspace</italic>) to import and reproduce the manual gating, extract cell subpopulation statistics, and access the single-cell event-level data required for downstream modeling of T-cell polyfunctionality and immunogenicity
                    <sup>
                        <xref ref-type="bibr" rid="ref-34">34</xref>&#x2013;
                        <xref ref-type="bibr" rid="ref-36">36</xref>
                    </sup>. Tables of extracted cell populations, cell counts, proportions, and fluorescence intensities are included in study packages, together with an Rmarkdown vignette describing the data processing. Due to the size of the raw FCS files, they are imported for processing from a location external to the data package source tree, so that the raw files are not part of the final package, but vignettes outlining the data processing are automatically included.</p>
                <p>Remaining assay data are of reasonable size and are provided raw data in tabular (csv) form, imported into the package, processed and standardized from the 
                    <monospace>inst/extdata</monospace> package directory. Users can run and connect to an Rstudio instance in the container, where code and data to build the 
                    <italic toggle="yes">MX1</italic> data package reside.</p>
            </sec>
        </sec>
        <sec>
            <title>Summary</title>
            <p>Reproducibility is increasingly emphasized for scientific publications. We describe a new utility R package, 
                <italic toggle="yes">DataPackageR</italic> that serves to help automate and track the processing and standardization of diverse data sets into 
                <italic toggle="yes">analysis-ready data packages</italic> that can be easily distributed for analysis and publication. 
                <italic toggle="yes">DataPackageR</italic>, when paired with a version control system such as 
                <italic toggle="yes">git</italic>, decouples data processing from data analysis while tracking changes to data sets, ensuring data objects are documented, and keeping a record of data processing pipelines as vignettes within the data package. The principle behind the tool is that it remains a lightweight and non-intrusive framework that easily plugs into most R-based data analytic work-flows. It places few restrictions on the user code therefore most existing scripts can be ported to use the package. The VISC has been using 
                <italic toggle="yes">DataPackageR</italic> for a number of years to perform reproducible end-to-end analysis of animal trial data, and the package has been used to publicly share sets for a number of published manuscripts
                <sup>
                    <xref ref-type="bibr" rid="ref-35">35</xref>,
                    <xref ref-type="bibr" rid="ref-37">37</xref>,
                    <xref ref-type="bibr" rid="ref-38">38</xref>
                </sup>.</p>
        </sec>
        <sec>
            <title>Software and data availability</title>
            <p>Source code available from: 
                <ext-link ext-link-type="uri" xlink:href="http://github.com/RGLab/DataPackageR">http://github.com/RGLab/DataPackageR</ext-link>
            </p>
            <p>Archived source code as at time of publication: 
                <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.1292095">https://doi.org/10.5281/zenodo.1292095</ext-link>
                <sup>
                    <xref ref-type="bibr" rid="ref-39">39</xref>
                </sup>
</p>
            <p>Reproducible examples: 
                <ext-link ext-link-type="uri" xlink:href="http://hub.docker.com/r/gfinak/datapackager/">http://hub.docker.com/r/gfinak/datapackager/</ext-link>
            </p>
            <p>License: MIT license</p>
            <p>Partial data for the MX1 study to demonstrate processing using 
                <italic toggle="yes">DataPackageR</italic> are available as a docker container from 
                <monospace>gfinak/DataPackageR:latest</monospace>. The processed MX1 data are available on the CAVD DataSpace data sharing and discovery tool at 
                <ext-link ext-link-type="uri" xlink:href="http://dataspace.cavd.org/">http://dataspace.cavd.org</ext-link> under study identifier, CAVD 451. At the time of publication the complete data set is available to CAVD members only.</p>
        </sec>
    </body>
    <back>
        <ack>
            <title>Acknowledgements</title>
            <p>The authors wish to acknowledge the contributions of the members of the VISC and the CAVD Data Space (CDS) for contributions to testing and feedback on the software. We also acknowledge the Collaboration for AIDS Vaccine Discovery (CAVD), the Comprehensive Antibody Vaccine Immune Monitoring Consortium (CAVIMC) and the Comprehensive Cellular Vaccine Immune Monitoring Consortium (CCVIMC), as well as Dr. Shiu-Lok Hu.</p>
        </ack>
        <ref-list>
            <ref id="ref-1">
                <label>1</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Baggerly</surname>
                            <given-names>KA</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Coombes</surname>
                            <given-names>KR</given-names>
                        </name>
</person-group>:
                    <article-title>What information should be required to support clinical "omics" publications?</article-title>
                    <source>

                        <italic toggle="yes">Clin Chem.</italic>
</source>
                    <year>2011</year>;<volume>57</volume>(<issue>5</issue>):<fpage>688</fpage>&#x2013;<lpage>690</lpage>.
                    <pub-id pub-id-type="pmid">21364027</pub-id>
                    <pub-id pub-id-type="doi">10.1373/clinchem.2010.158618</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-2">
                <label>2</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Gentleman</surname>
                            <given-names>R</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Lang</surname>
                            <given-names>DT</given-names>
                        </name>
</person-group>:
                    <article-title>Statistical analyses and reproducible research.</article-title>
                    <italic toggle="yes">Bioconductor Project Working Papers. </italic>
                    <year>2004</year>.
                    <ext-link ext-link-type="uri" xlink:href="http://lcolladotor.github.io/courses/Courses/R/resources/StatisticalAnalysesAndReproducibility.pdf">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-3">
                <label>3</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Marwick</surname>
                            <given-names>B</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Boettiger</surname>
                            <given-names>C</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Mullen</surname>
                            <given-names>L</given-names>
                        </name>
</person-group>:
                    <article-title>Packaging data analytical work reproducibly using R (and friends)</article-title>.
                    <source>

                        <italic toggle="yes">PeerJ Preprints</italic>
</source>; PeerJ Inc.,<year>2018</year>.
                    <pub-id pub-id-type="doi">10.7287/peerj.preprints.3192v2</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-4">
                <label>4</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Stodden</surname>
                            <given-names>V</given-names>
                        </name>
</person-group>:
                    <article-title>Enabling reproducible research: Open licensing for scientific innovation.</article-title>
                    <source>

                        <italic toggle="yes">International Journal of Communications Law and Policy.</italic>
</source>
                    <year>2009</year>.
                    <ext-link ext-link-type="uri" xlink:href="https://web.stanford.edu/~vcs/papers/ERROLSI03092009.pdf">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-5">
                <label>5</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Stodden</surname>
                            <given-names>V</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Borwein</surname>
                            <given-names>J</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Bailey</surname>
                            <given-names>DH</given-names>
                        </name>
</person-group>:
                    <article-title>Publishing standards for computational science: "Setting the default to reproducible"</article-title>.<year>2013</year>.
                    <ext-link ext-link-type="uri" xlink:href="https://carma.newcastle.edu.au/jon/SIAMNews.pdf">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-6">
                <label>6</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Lortie</surname>
                            <given-names>CJ</given-names>
                        </name>
</person-group>:
                    <article-title>A review of R for data science: Key elements and a critical analysis</article-title>.
                    <source>

                        <italic toggle="yes">PeerJ Preprints</italic>
</source>; PeerJ Inc.,<year>2017</year>.
                    <pub-id pub-id-type="doi">10.7287/peerj.preprints.2873v1</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-7">
                <label>7</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Wickham</surname>
                            <given-names>H</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Grolemund</surname>
                            <given-names>G</given-names>
                        </name>
</person-group>:
                    <article-title>R for data science: Import, tidy, transform, visualize, and model data</article-title>. &#x2018;O&#x2019;Reilly Media, Inc.&#x2019;,<year>2016</year>.
                    <ext-link ext-link-type="uri" xlink:href="http://shop.oreilly.com/product/0636920034407.do">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-8">
                <label>8</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Huang</surname>
                            <given-names>Y</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Gottardo</surname>
                            <given-names>R</given-names>
                        </name>
</person-group>:
                    <article-title>Comparability and reproducibility of biomedical data.</article-title>
                    <source>

                        <italic toggle="yes">Brief Bioinform.</italic>
</source>
                    <year>2013</year>;<volume>14</volume>(<issue>4</issue>):<fpage>391</fpage>&#x2013;<lpage>401</lpage>.
                    <pub-id pub-id-type="pmid">23193203</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bib/bbs078</pub-id>
                    <pub-id pub-id-type="pmcid">3713713</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-9">
                <label>9</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Buck</surname>
                            <given-names>S</given-names>
                        </name>
</person-group>:
                    <article-title>Solving reproducibility.</article-title>
                    <source>

                        <italic toggle="yes">Science.</italic>
</source>
                    <year>2015</year>;<volume>348</volume>(<issue>6242</issue>):<fpage>1403</fpage>.
                    <pub-id pub-id-type="pmid">26113692</pub-id>
                    <pub-id pub-id-type="doi">10.1126/science.aac8041</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-10">
                <label>10</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Peng</surname>
                            <given-names>R</given-names>
                        </name>
</person-group>:
                    <article-title>The reproducibility crisis in science: A statistical counterattack.</article-title>
                    <source>

                        <italic toggle="yes">Significance.</italic>
</source>
                    <year>2015</year>;<volume>12</volume>(<issue>3</issue>):<fpage>30</fpage>&#x2013;<lpage>32</lpage>.
                    <pub-id pub-id-type="doi">10.1111/j.1740-9713.2015.00827.x</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-11">
                <label>11</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Morrison</surname>
                            <given-names>SJ</given-names>
                        </name>
</person-group>:
                    <article-title>Time to do something about reproducibility.</article-title>
                    <source>

                        <italic toggle="yes">eLife.</italic>
</source>
                    <year>2014</year>;<volume>3</volume>:<fpage>e03981</fpage>.
                    <pub-id pub-id-type="pmid">25493617</pub-id>
                    <pub-id pub-id-type="doi">10.7554/eLife.03981</pub-id>
                    <pub-id pub-id-type="pmcid">4260475</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-12">
                <label>12</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Yaffe</surname>
                            <given-names>MB</given-names>
                        </name>
</person-group>:
                    <article-title>Reproducibility in science.</article-title>
                    <source>

                        <italic toggle="yes">Sci Signal.</italic>
</source>
                    <year>2015</year>;<volume>8</volume>(<issue>371</issue>):<fpage>eg5</fpage>.
                    <pub-id pub-id-type="pmid">25852185</pub-id>
                    <pub-id pub-id-type="doi">10.1126/scisignal.aaa5764</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-13">
                <label>13</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Begley</surname>
                            <given-names>CG</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Ioannidis</surname>
                            <given-names>JP</given-names>
                        </name>
</person-group>:
                    <article-title>Reproducibility in science: Improving the standard for basic and preclinical research.</article-title>
                    <source>

                        <italic toggle="yes">Circ Res.</italic>
</source>
                    <year>2015</year>;<volume>116</volume>(<issue>1</issue>):<fpage>116</fpage>&#x2013;<lpage>126</lpage>.
                    <pub-id pub-id-type="pmid">25552691</pub-id>
                    <pub-id pub-id-type="doi">10.1161/CIRCRESAHA.114.303819</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-14">
                <label>14</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Stodden</surname>
                            <given-names>V</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Leisch</surname>
                            <given-names>F</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Peng</surname>
                            <given-names>RD</given-names>
                        </name>
</person-group>:
                    <article-title>Implementing reproducible research</article-title>. CRC Press,<year>2014</year>.
                    <ext-link ext-link-type="uri" xlink:href="https://books.google.co.in/books/about/Implementing_Reproducible_Research.html?id=JcmSAwAAQBAJ&amp;redir_esc=y">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-15">
                <label>15</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Freedman</surname>
                            <given-names>LP</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Inglese</surname>
                            <given-names>J</given-names>
                        </name>
</person-group>:
                    <article-title>The increasing urgency for standards in basic biologic research.</article-title>
                    <source>

                        <italic toggle="yes">Cancer Res.</italic>
</source>
                    <year>2014</year>;<volume>74</volume>(<issue>15</issue>):<fpage>4024</fpage>&#x2013;<lpage>4029</lpage>.
                    <pub-id pub-id-type="pmid">25035389</pub-id>
                    <pub-id pub-id-type="doi">10.1158/0008-5472.CAN-14-0925</pub-id>
                    <pub-id pub-id-type="pmcid">4975040</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-16">
                <label>16</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Boettiger</surname>
                            <given-names>C</given-names>
                        </name>
</person-group>:
                    <article-title>An introduction to docker for reproducible research.</article-title>
                    <source>

                        <italic toggle="yes">Oper Syst Rev.</italic>
</source>
                    <year>2015</year>;<volume>49</volume>(<issue>1</issue>):<fpage>71</fpage>&#x2013;<lpage>79</lpage>.
                    <pub-id pub-id-type="doi">10.1145/2723872.2723882</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-17">
                <label>17</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>McNutt</surname>
                            <given-names>M</given-names>
                        </name>
</person-group>:
                    <article-title>Journals unite for reproducibility.</article-title>
                    <source>

                        <italic toggle="yes">Science.</italic>
</source>
                    <year>2014</year>;<volume>346</volume>(<issue>6210</issue>):<fpage>679</fpage>.
                    <pub-id pub-id-type="pmid">25383411</pub-id>
                    <pub-id pub-id-type="doi">10.1126/science.aaa1724</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-18">
                <label>18</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Peng</surname>
                            <given-names>RD</given-names>
                        </name>
</person-group>:
                    <article-title>Reproducible research in computational science.</article-title>
                    <source>

                        <italic toggle="yes">Science.</italic>
</source>
                    <year>2011</year>;<volume>334</volume>(<issue>6060</issue>):<fpage>1226</fpage>&#x2013;<lpage>1227</lpage>.
                    <pub-id pub-id-type="pmid">22144613</pub-id>
                    <pub-id pub-id-type="doi">10.1126/science.1213847</pub-id>
                    <pub-id pub-id-type="pmcid">3383002</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-19">
                <label>19</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Gentleman</surname>
                            <given-names>R</given-names>
                        </name>
</person-group>:
                    <article-title>Reproducible research: A bioinformatics case study.</article-title>
                    <source>

                        <italic toggle="yes">Stat Appl Genet Mol Biol.</italic>
</source>
                    <year>2005</year>;<volume>4</volume>(<issue>1</issue>): Article2.
                    <pub-id pub-id-type="pmid">16646837</pub-id>
                    <pub-id pub-id-type="doi">10.2202/1544-6115.1034</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-20">
                <label>20</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Peng</surname>
                            <given-names>RD</given-names>
                        </name>
</person-group>:
                    <article-title>Reproducible research and Biostatistics.</article-title>
                    <source>

                        <italic toggle="yes">Biostatistics.</italic>
</source>
                    <year>2009</year>;<volume>10</volume>(<issue>3</issue>):<fpage>405</fpage>&#x2013;<lpage>408</lpage>.
                    <pub-id pub-id-type="pmid">19535325</pub-id>
                    <pub-id pub-id-type="doi">10.1093/biostatistics/kxp014</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-21">
                <label>21</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Mesirov</surname>
                            <given-names>JP</given-names>
                        </name>
					</person-group>:
                    <article-title>Computer science. Accessible reproducible research.</article-title>
                    <source>
						
                        <italic toggle="yes">Science.</italic>
					</source>
                    <year>2010</year>;<volume>327</volume>(<issue>5964</issue>):<fpage>415</fpage>&#x2013;<lpage>6</lpage>.
                    <pub-id pub-id-type="pmid">20093459</pub-id>
                    <pub-id pub-id-type="doi">10.1126/science.1179653</pub-id>
                    <pub-id pub-id-type="pmcid">3878063</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-22">
                <label>22</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Gentleman</surname>
                            <given-names>RC</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Carey</surname>
                            <given-names>VJ</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Bates</surname>
                            <given-names>DM</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Bioconductor: open software development for computational biology and bioinformatics.</article-title>
                    <source>
						
                        <italic toggle="yes">Genome Biol.</italic>
					</source>
                    <year>2004</year>;<volume>5</volume>(<issue>10</issue>):<fpage>R80</fpage>.
                    <pub-id pub-id-type="pmid">15461798</pub-id>
                    <pub-id pub-id-type="doi">10.1186/gb-2004-5-10-r80</pub-id>
                    <pub-id pub-id-type="pmcid">545600</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-23">
                <label>23</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Finak</surname>
                            <given-names>G</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Gottardo</surname>
                            <given-names>R</given-names>
                        </name>
					</person-group>:
                    <article-title>Promises and Pitfalls of High-Throughput Biological Assays.</article-title>
                    <source>
						
                        <italic toggle="yes">Methods Mol Biol.</italic>
					</source>
                    <year>2016</year>;<volume>1415</volume>:<fpage>225</fpage>&#x2013;<lpage>243</lpage>.
                    <pub-id pub-id-type="pmid">27115636</pub-id>
                    <pub-id pub-id-type="doi">10.1007/978-1-4939-3572-7_12</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-24">
                <label>24</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Allaire</surname>
                            <given-names>J</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Rmarkdown: Dynamic documents for R</article-title>.<year>2015</year>.</mixed-citation>
            </ref>
            <ref id="ref-25">
                <label>25</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Xie</surname>
                            <given-names>Y</given-names>
                        </name>
					</person-group>:
                    <article-title>Knitr: A comprehensive tool for reproducible research in R.</article-title>
                    <source>
						
                        <italic toggle="yes">Implement Reprod Res.</italic>
					</source>
                    <year>2014</year>;<volume>1</volume>:<fpage>20</fpage>.
                    <ext-link ext-link-type="uri" xlink:href="https://books.google.co.in/books?hl=en&amp;lr=&amp;id=WVTSBQAAQBAJ&amp;oi=fnd&amp;pg=PA3&amp;ots=qSxB6aFgOX&amp;sig=HUXpUeXqVWctlChMt8UIfZYk23c&amp;redir_esc=y#v=onepage&amp;q&amp;f=false">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-26">
                <label>26</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Baumer</surname>
                            <given-names>B</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Udwin</surname>
                            <given-names>D</given-names>
                        </name>
					</person-group>:
                    <article-title>R markdown.</article-title>
                    <source>
						
                        <italic toggle="yes">WIREs Comput Stat.</italic>
					</source>
                    <year>2015</year>;<volume>7</volume>(<issue>3</issue>):<fpage>167</fpage>&#x2013;<lpage>177</lpage>.
                    <pub-id pub-id-type="doi">10.1002/wics.1348</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-27">
                <label>27</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Ram</surname>
                            <given-names>K</given-names>
                        </name>
					</person-group>:
                    <article-title>Git can facilitate greater reproducibility and increased transparency in science.</article-title>
                    <source>
						
                        <italic toggle="yes">Source Code Biol Med.</italic>
					</source>
                    <year>2013</year>;<volume>8</volume>(<issue>1</issue>):<fpage>7</fpage>.
                    <pub-id pub-id-type="pmid">23448176</pub-id>
                    <pub-id pub-id-type="doi">10.1186/1751-0473-8-7</pub-id>
                    <pub-id pub-id-type="pmcid">3639880</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-28">
                <label>28</label>
                <mixed-citation publication-type="journal">
                    <collab>Project, T</collab>:
                    <article-title>rOpenSci. Use of an r package to facilitate reproducible research</article-title>.<year>2015</year>.</mixed-citation>
            </ref>
            <ref id="ref-29">
                <label>29</label>
                <mixed-citation publication-type="journal">
                    <collab>Project, T</collab>:
                    <article-title>rOpenSci. A guide to reproducible research</article-title>.<year>2015</year>.</mixed-citation>
            </ref>
            <ref id="ref-30">
                <label>30</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Michael Landau</surname>
                            <given-names>W</given-names>
                        </name>
					</person-group>:
                    <article-title>The drake R package: A pipeline toolkit for reproducibility and high-performance computing.</article-title>
                    <source>
						
                        <italic toggle="yes">JOSS.</italic>
					</source>
                    <year>2018</year>;<volume>3</volume>(<issue>21</issue>):<fpage>550</fpage>.
                    <pub-id pub-id-type="doi">10.21105/joss.00550</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-31">
                <label>31</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Ihaka</surname>
                            <given-names>R</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Gentleman</surname>
                            <given-names>R</given-names>
                        </name>
					</person-group>:
                    <article-title>R: A language for data analysis and graphics.</article-title>
                    <source>
						
                        <italic toggle="yes">J Comput Graph Stat.</italic>
					</source>
                    <year>1996</year>;<volume>5</volume>(<issue>3</issue>):<fpage>299</fpage>&#x2013;<lpage>314</lpage>.
                    <pub-id pub-id-type="doi">10.2307/1390807</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-32">
                <label>32</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Wickham</surname>
                            <given-names>H</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Chang</surname>
                            <given-names>W</given-names>
                        </name>
					</person-group>:
                    <article-title>Devtools: Tools to make developing r packages easier</article-title>.<year>2018</year>.
                    <ext-link ext-link-type="uri" xlink:href="https://cran.r-project.org/web/packages/devtools/devtools.pdf">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-33">
                <label>33</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Goecks</surname>
                            <given-names>J</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Nekrutenko</surname>
                            <given-names>A</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Taylor</surname>
                            <given-names>J</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.</article-title>
                    <source>
						
                        <italic toggle="yes">Genome Biol.</italic>
					</source>
                    <year>2010</year>;<volume>11</volume>(<issue>8</issue>):<fpage>R86</fpage>.
                    <pub-id pub-id-type="pmid">20738864</pub-id>
                    <pub-id pub-id-type="doi">10.1186/gb-2010-11-8-r86</pub-id>
                    <pub-id pub-id-type="pmcid">2945788</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-34">
                <label>34</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Finak</surname>
                            <given-names>G</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Jiang</surname>
                            <given-names>M</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Andre</surname>
                            <given-names>M</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>FlowWorkspace: A new R package for importing flow cytometry data into bioconductor from flowJo</article-title>. Fred Hutchinson Cancer Research Center; Poster B232, CYTO 2011 XXVI Congress of the International Society for Advancement of Cytometry Baltimore Convention Center, Baltimore, Maryland, USA May 21 &#x00d0; 25,<year>2010</year>.</mixed-citation>
            </ref>
            <ref id="ref-35">
                <label>35</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Lin</surname>
                            <given-names>L</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Finak</surname>
                            <given-names>G</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Ushey</surname>
                            <given-names>K</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>COMPASS identifies t-cell subsets correlated with clinical outcomes.</article-title>
                    <source>
						
                        <italic toggle="yes">Nat Biotechnol.</italic>
					</source>
                    <year>2015</year>;<volume>33</volume>(<issue>6</issue>):<fpage>610</fpage>&#x2013;<lpage>616</lpage>.
                    <pub-id pub-id-type="pmid">26006008</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nbt.3187</pub-id>
                    <pub-id pub-id-type="pmcid">4569006</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-36">
                <label>36</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Finak</surname>
                            <given-names>G</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>McDavid</surname>
                            <given-names>A</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Chattopadhyay</surname>
                            <given-names>P</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Mixture models for single-cell assays with applications to vaccine studies.</article-title>
                    <source>
						
                        <italic toggle="yes">Biostatistics.</italic>
					</source>
                    <year>2014</year>;<volume>15</volume>(<issue>1</issue>):<fpage>87</fpage>&#x2013;<lpage>101</lpage>.
                    <pub-id pub-id-type="pmid">23887981</pub-id>
                    <pub-id pub-id-type="doi">10.1093/biostatistics/kxt024</pub-id>
                    <pub-id pub-id-type="pmcid">3862207</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-37">
                <label>37</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Finak</surname>
                            <given-names>G</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>McDavid</surname>
                            <given-names>A</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Yajima</surname>
                            <given-names>M</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>MAST: A flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data.</article-title>
                    <source>
						
                        <italic toggle="yes">Genome Biol.</italic>
					</source>
                    <year>2015</year>;<volume>16</volume>:<fpage>278</fpage>.
                    <pub-id pub-id-type="pmid">26653891</pub-id>
                    <pub-id pub-id-type="doi">10.1186/s13059-015-0844-5</pub-id>
                    <pub-id pub-id-type="pmcid">4676162</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-38">
                <label>38</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Bolton</surname>
                            <given-names>DL</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>McGinnis</surname>
                            <given-names>K</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Finak</surname>
                            <given-names>G</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Combined single-cell quantitation of host and SIV genes and proteins 
                        <italic toggle="yes">ex vivo</italic> reveals host-pathogen interactions in individual cells.</article-title>
                    <source>
						
                        <italic toggle="yes">PLoS Pathog.</italic>
					</source>
                    <year>2017</year>;<volume>13</volume>(<issue>6</issue>):<fpage>e1006445</fpage>.
                    <pub-id pub-id-type="pmid">28654687</pub-id>
                    <pub-id pub-id-type="doi">10.1371/journal.ppat.1006445</pub-id>
                    <pub-id pub-id-type="pmcid">5507340</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-39">
                <label>39</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Finak</surname>
                            <given-names>G</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Obrecht</surname>
                            <given-names>P</given-names>
                        </name>
					</person-group>:
                    <article-title>RGLab/DataPackageR v0.13.2 (Version v0.13.2).</article-title>
                    <source>
						
                        <italic toggle="yes">Zenodo.</italic>
					</source>
                    <year>2018</year>.
                    <ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.5281/zenodo.1292312">Data Source</ext-link>
                </mixed-citation>
            </ref>
        </ref-list>
    </back>
    <sub-article article-type="reviewer-report" id="report26541">
        <front-stub>
            <article-id pub-id-type="doi">10.21956/gatesopenres.13908.r26541</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Lun</surname>
                        <given-names>Aaron T.L.</given-names>
                    </name>
                    <xref ref-type="aff" rid="r26541a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-3564-4813</uri>
                </contrib>
                <aff id="r26541a1">
                    <label>1</label>Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>25</day>
                <month>6</month>
                <year>2018</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2018 Lun ATL</copyright-statement>
                <copyright-year>2018</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport26541" related-article-type="peer-reviewed-article" xlink:href="10.12688/gatesopenres.12832.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>In this article, Finak et al. describe a new package for pre-processing and sharing of data within the R programming language. Their 
                <bold>DataPackageR</bold> package provides a standardized framework for traceable pre-processing and version control of R data objects, facilitating reproducible research across a diverse team of collaborators. The manuscript is clear and concise, and the function and benefits of the software are clear. Nonetheless, I have some comments (listed below) that the authors might consider to improve both the software and the manuscript. 
                <list list-type="order">
                    <list-item>
                        <p>The processed data are stored as 
                            <italic>RData</italic> files in the 
                            <italic>data/</italic> directory and distributed as part of the constructed data package. These files can become somewhat large (&gt; 100 MB), which in and of itself is not a problem; indeed, having to deal with large data files may not be avoidable in some contexts. However, the text on page 5 and Figure 1 suggest that the user should place the generated data files under version control via Git prior to distribution. With Git, every clone of the package will contain every version of all data files, inflating the size of the repository for download and on disk. This may become prohibitive for practical use (e.g., GitHub forbids uploads of files above a certain size), and feels unnecessary given that only one version of the data will be active at any one time. Perhaps the authors would consider supporting the versioning and acquisition of 
                            <italic>RData</italic> files from a separate location, mimicking the behaviour of (or directly using) Bioconductor's 
                            <bold>ExperimentHub</bold> where large data files are downloaded and cached locally as needed?</p>
                    </list-item>
                    <list-item>
                        <p>The YAML header seems to be limited to R scripts or 
                            <italic>Rmd</italic> files for data pre-processing. However, as the authors would appreciate, a lot of pre-processing is performed on the command line, e.g., with aligners and related software. It would be awkward (and involve extra work) to have to wrap these scripts in R/
                            <italic>Rmd</italic> files in order for them to be executed by 
                            <bold>DataPackageR</bold>. To provide a concrete example: I would like 
                            <bold>DataPackageR</bold> to directly call my existing Bash scripts for alignment of RNA-seq data, followed by execution of 
                            <italic>Rmd</italic> reports for assigning reads to genes to get a count table that is saved as an 
                            <italic>RData</italic> object.</p>
                    </list-item>
                    <list-item>
                        <p>Does the YAML header permit dependencies between scripts? Say I have a long pre-processing pipeline that is split across multiple scripts for convenience. Does the order of files in the header reflected in the order of execution? And are they executed in the same location, so that intermediate files produced by one script (that are not intended to be included in the final package) can be picked up as inputs to the following script?</p>
                    </list-item>
                    <list-item>
                        <p>The MX1 use case on pages 6-7 seems under-described - I don't get an intuition as to why 
                            <bold>DataPackageR</bold> was beneficial for this project. It would be demonstrative to have some specific examples of how the version control and change tracking were advantageous in a collaborative environment. For example, how many 
                            <italic>DataVersion</italic> bumps were performed throughout the course of the project, and why? Were there any issues arising from changes to the processed data, and how were these resolved?</p>
                    </list-item>
                </list>
            </p>
            <p>Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?</p>
            <p>Partly</p>
            <p>Is the rationale for developing the new software tool clearly explained?</p>
            <p>Yes</p>
            <p>Is the description of the software tool technically sound?</p>
            <p>Yes</p>
            <p>Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?</p>
            <p>Yes</p>
            <p>Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>NA</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
        </body>
    </sub-article>
</article>
