Assembled genomic and tissue-specific transcriptomic data resources for two genetically distinct lines of Cowpea ( Vigna unguiculata (L.) Walp)

Cowpea ( Vigna unguiculata (L.) Walp) is an important legume crop for food security in areas of low-input and smallholder farming throughout Africa and Asia. Genetic improvements are required to increase yield and resilience to biotic and abiotic stress and to enhance cowpea crop performance. An integrated cowpea genomic and gene expression data resource has the potential to greatly accelerate breeding and the delivery of novel genetic traits for cowpea. Extensive genomic resources for cowpea have been absent from the public domain; however, a recent early release reference genome for IT97K-499-35 ( Vigna unguiculata v1.0, NSF, UCR, USAID, DOE-JGI, http://phytozome.jgi.doe.gov/) has now been established in a collaboration between the Joint Genome Institute (JGI) and University California (UC) Riverside. Here we release supporting genomic and transcriptomic data for two transformable cowpea varieties, IT97K-499-35 and IT86D-1010. The transcriptome resource includes six tissue-specific datasets for each variety, with particular emphasis on reproductive tissues that extend and support the V. unguiculata v1.0 reference. Annotations have been included in our resource to allow direct mapping to the v1.0 cowpea reference. The resource described here is supported by downloadable raw and assembled sequence data.


Amendments from Version 1
All of the reviewer comments were very helpful towards improving the clarity, accuracy and utility of the data presented and all were addressed in the text as detailed below. In addition the data repository supporting the paper has been extended to provide the quality metrics across all the raw DNA and RNA sequence files used in this data collection.
• Changes in Abstract: º We more clearly stated here that both cowpea varieties sequenced are transformable.
• Changes in the Introduction: º We clarified the contributions to the generation of the Vigna unguiculata v1.0 reference assembly to be "generated by UC Riverside and subsequently annotated by the Joint Genome Institute". º We deposited an additional data file in the online repository describing a range of quality metrics associated with all raw sequence files to assist in assessment of the quality of the raw sequence used in the paper. We added text in the methods to direct the reader to this file.
• Changes in Data Set Validation: º We improved the text to clarify the trend observed for 70-90% overlap thresholds of assembled contigs to the reference genome.
º We improved the text to clarify when denovo assembled transcript contigs were being counted, in contrast to predicted gene models in the genome reference.

REVISED
of multiple varieties with improved yield and stress tolerance. However, further improvement is required as many varieties in use exhibit low yield, disease susceptibility, and are prone to abiotic stress (Hall, 2012). Reproductive characteristics have been revisited in cowpea recently and developmental calendars developed for two cowpea varieties developed by IITA, IT86D-1010 and IT97K-499-35 together with supporting developmental experimental tools to support seed yield improvements (Salinas-Gamboa et al., 2016). One approach to increase yield aims to alter sexual reproductive development in high yielding hybrids to an asexual mode in order to assess if it is feasible to save hybrid cowpea seed each growing season (Salinas-Gamboa et al., 2016; Capturing Heterosis OPP1076280). Technological advances in genetic profiling and DNA sequencing approaches over the last decade have facilitated the recent establishment of genomic resources for cowpea (Muñoz-Amatriaín et al., 2017). These data resources have the potential to rapidly accelerate cowpea crop improvement through molecular assisted breeding, characterisation of population diversity and various genomic editing technologies.
The cowpea genome (2n=22) has an estimated size of 620 megabases (Mb) (Chen et al., 2007) . Despite the substantial contribution and utility of these resources, they did not represent a complete contiguous sequence or 'reference' assembly of the cowpea genome.

Introduction
Cowpea (Vigna unguiculata (L.) Walp) is a versatile grain legume crop, also cultivated for vegetative consumption and animal fodder. The grain provides a rich source of protein (25% by weight) for human consumption. Cowpea was domesticated in sub-Saharan Africa and is relatively resilient to heat and drought stress. It has the ability to fix atmospheric nitrogen, and cowpea is often intercropped with cereals or used in crop rotations. Cowpea is grown frequently on subsistence and smallholder farms in mixed crop-livestock systems, particularly in low-input farming systems in the semi-arid regions of West and Central Africa, South America, and Asia (Singh, 2014). Cowpea is a vital component for nutrient security in global agricultural communities.
Cowpea crop improvement has been led by the International Institute of Tropical Agriculture (IITA) through the generation In this publication, we describe and release survey genome assemblies and tissue-specific transcriptome assemblies derived from IT86D-1010 and IT97K-499-35 to supplement and extend the existing cowpea sequence resources. These cowpea varieties, of different pedigrees, are transformable using Agrobacteriummediated gene insertion (Popelka et al., 2006). They therefore represent important genetic resources for investigating and substantiating gene function. In addition, their genomic and transcriptomic characterisation will enable identification and testing of cell-type specific promoters and genic tools that should facilitate the examination and synthesis of reproductive pathways to improve seed yield in cowpea. We have therefore developed transcriptomic resources to characterise expressed genes in leaf and importantly floral tissues undergoing male and female gametogenic development, and early seed initiation.
The survey genome assembly of IT97K-499-35 supports the reference genome assembly, Vigna unguiculata v1.0, of IT97K-499-35; however, the IT86D-1010 data resource is the first public genome-scale resource for this variety. Additional cowpea transcriptome resources are provided for leaf and reproductive tissues for both IT86D-1010 and IT97K-499-35. In accordance with the policies of early release genomes, an extensive comparative analysis of data provided here with the reference assembly (Vigna unguiculata v1.0) is not provided. However, we have annotated our transcriptomic and genomic contig data with coordinates of the v1.0 reference, based on IT97K-499-35, to further facilitate integration of publicly available cowpea genome and transcriptome resources.
Transcriptomes of multiple tissues derived from IT97K-499-35 have been generated and previously published (Yao et al., 2016; http://vugea.noble.org). Tissues previously profiled were predominately vegetative and included leaf, stem, root and flower from 5-week-old plants, empty seed pods at 6, 10 and 16 days after pollination and seeds at 8, 10, 14 and 18 days after pollination (DAP) (Yao et al., 2016). In this publication, we provide the first transcriptomic characterisation in both IT97K-499-35 and IT86D-1010 for floral tissues undergoing male and female gametogenic development, and early seed initiation.
The work described in this publication provides a unique and valuable extension to emerging genomic and transcriptomic resources in cowpea. These foundational resources will enable identification and testing of cell-type specific promoters and genic tools that should facilitate the examination and synthesis of reproductive pathways to improve seed yield in cowpea. All transcriptomic and genomic resources are provided with coordinate-based annotation to the IT97K-499-35 reference genome (V. unguiculata v1.0) providing integration of these resources to assist coordinated scientific progression of the cowpea research community.

Plant materials and tissue collection
Cowpea lines IT86D-1010 and IT97K-499-35 were originally sourced from the International Institute of Tropical Agriculture (IITA) and their pedigrees are provided in Supplementary  Figure 1. Lines have been maintained in CSIRO for more than 10 generations (not through single seed descent). Material (IT97K-499-35) used in the generation of the reference assembly (Vigna unguiculata v1.0) was sourced from Mike Timko at the School of Medicine, University of Virginia, who had previously received the material from IITA. UC Riverside took the IT97K-499-35 line through single seed descent and confirmed 100% homozygosity before bulking. Analysis is underway to compare the CSIRO lines and UC Riverside lines to quantitatively assess genetic similarity of the independently sourced seed stocks. The plants were grown as described by Salinas-Gamboa et al. (2016). Young unexpanded leaves were collected for DNA and total RNA extraction for both lines. The reproductive calendars developed for these varieties by Salinas-Gamboa et al. (2016) were used to harvest a set of five reproductive tissue types from both IT86D-1010 and IT97K-499-35. Anther tissues containing developing male gametophytes at pollen mother cell, tetrad and mature bicellular pollen stages were pooled to form a pooled male gametophyte (PMG) sample for both lines. In addition, ovules were extracted from both lines from floral buds to provide individual tissue samples containing differentiated megaspore mother cells (MMCs), female meiotic tetrads (FMT), and mature female gametophytes (MFG) at anthesis. Finally, early developing seeds (ES) were collected post-fertilization containing a mixture of zygotes and early globular embryos with proliferating endosperm.
Nucleic acid extraction and sequencing DNA and RNA extractions were carried out using a Qiagen maxi DNA kit and Qiagen RNeasy plant mini kit, respectively, as per the manufacturer's instructions. Illumina library preparation and sequencing of DNA and RNA was undertaken by the Australian Genome Research Facility (AGRF) with 2 × 100 bp standard insert paired-end sequencing using a Hiseq 2500 system. Shotgun sequencing libraries from single IT86D-1010 and IT97K-499-35 genomic DNA samples were prepared using the Illumina TruSeq Nano DNA Library Prep Kit as per the manufacturer's instructions. Whereas the Illumina TruSeq Stranded mRNA Library Prep Kit was used to prepare poly(A) mRNA sequencing libraries from total RNA samples as per the manufacturer's instructions. Three replicate libraries were prepared and sequenced for each of the RNA samples except for the IT97K-499-35 MMC and FMT samples, where two replicates were sequenced.

Sequence analysis
Raw genomic DNA sequencing reads from IT86D-1010 and IT97K-499-35 were separately assembled into contigs using Biokanga (version 4.3.6) in a multi-step process. First, raw reads were run through 'biokanga filter', where common adapter, primer, and vector contaminants were identified and trimmed. Redundant copies of identical paired-end read pairs were removed, and pairs with no sequence overlap to other raw sequence were also removed as they provided no value to the assembly. Filtered paired-end reads were then assembled into contigs, using 'biokanga assemb', with default parameterisation that allows 1 base substitution per 100bp of sequence overlap. Resulting contigs were run through a second 'reassembly' step with 'biokanga assemb', allowing up to 5 base substitutions per 100bp of sequence overlaps to provide reduction in redundant sequence representations. Finally, 'biokanga scaffold' added ordering to some contigs, by identifying raw paired-end pairs that match to ends of different contigs under assumptions of sequencing insert fragment size of 110-1500bp. Raw tissue-specific RNASeq reads were separately assembled into transcriptome contigs using Biokanga, with the same multi-step process as used for the genomic DNA reads (above), without the reassembly step to retain putative transcript isoforms. Summary quality metrics for all DNA and RNA sequence reads are provided in the data repository associated with this paper as Supplementary Data 1. These metrics include read length, average GC content and average quality score across the length of the read, the read midpoint and read end. To complete sequence alignment analysis, the genomic DNA sequencing reads and the tissue-specific RNASeq reads from IT86D-1010 and IT97K-499-35 were pre-processed by 'biokanga filter' as described above, prior to alignment with the genomic sequence assemblies of IT86D-1010 and IT97K-499-35 and to the Vigna unguiculata v1.0 reference genome sequences. The software 'biokanga align' was used for these alignments and unique-best alignments for each paired-end sequence with an insert fragment size of 100-1000bp to a genomic sequence were reported, with at most 3 base substitutions per 100bp. Autoend-trimming (read chimera detection) was permitted to 50bp where required. Detection and reporting of SNPs between DNA or RNA sequencing reads and assembled genomes was enabled where there was coverage of at least 5 reads.

Dataset validation
Genomic sequence data for IT86D-1010 and IT97K-499-35 A total of 527 and 303 million pair-end DNA sequence reads from IT86D-1010 and IT97K-499-35, were generated, respectively. These were assembled into 39,123 contigs for IT86D-1010 and 57,690 contigs for IT97K-499-35 with average lengths of 15.6 and 9.8 kilobases (kb), respectively ( Table 1). The contig assemblies generated were able to incorporate 68-73% of the raw DNA reads generated ( Table 2). The majority (>87%) of the assembled genomic contigs from IT86D-1010 and IT97K-499-35 could be mapped to the V. unguiculata v1.0 reference genome (Table 3) with a minimum of 70% contig overlap. When the required contig overlap increased to 90%, contig mapping to the reference decreased to an average of 63% across assembled datasets. Possible causes for lack of mapping at higher stringencies are loss of contiguous alignment or loss of fidelity of assembly towards the end of the survey assembly contigs. In-silico gene prediction identified approximately 60,000 putative coding sequences in both IT86D-1010 and IT97K-499-35 and nearly 70% of these could be annotated to published protein sequences within the NCBI nr public database (Table 4).
Leaf and reproductive cell-type and seed transcriptomes and genomic comparisons RNA sequencing of the six tissue transcriptomes for each variety generated read counts varying from 125 to 265 million pair-end sequences. These could be assembled into transcript   sets varying in size between 35,000 to 74,000 transcript contigs averaging 1 kilobase in length (Table 5 and Table 6). In both cowpea varieties, leaf transcriptomes were the largest in terms of de novo assembled contig numbers and the anther transcriptomes were the smallest. In subsequent analyses RNA sequence read alignment to predicted gene models within the assembled genome resources were used to compare expression counts across tissues. The assembled genome resources for both cowpea varieties provided good coverage for the analysis of RNA sequence reads as approximately 70% of reads across all tissues  could be aligned uniquely to all three genomic resources. Transcriptomes derived from IT86D-1010 displayed slightly greater alignment to the IT86D-1010 genomic resource, than the corresponding comparisons for IT97K-499-35 (Table 7). The majority of transcript contigs (80 to 88%) across all tissues in both cultivars could be mapped to the V. unguiculata v1.0 reference genome with a minimum of 70% contig coverage ( Table 3). The remaining unmapped percentage could represent a range of scenarios including IT86D-1010 specific contigs, missing regions in the V. unguiculata v1.0 reference genome, tissue-specific extensions to the IT97K-499-35 resource or misassembled transcript contigs. Predicted gene models were considered expressed if they accrued at least 20 uniquely aligning RNASeq reads. In all tissues, approximately 30% of predicted gene models  (Table 8) showed expression and 6% of predicted gene models displayed strong tissue-specific expression. We found that on average 90% of IT86D-1010 transcript contigs could be mapped within a IT86D-1010 genomic contig and that the median genomic contig size was 67 kb relative to median transcript contig size of 1.3 kb. This indicates that this resource contains substantial amounts of genomic sequence context around expressed genes in these tissues. This will be important for future 1.

2.
3. I am somewhat confused by the consistency of the proportion of total gene models that accrue RNA sequencing reads (Table 8) versus the statement on page 5 that leaf transcriptomes had the largest number of contigs and anthers the smallest. I must not be grasping the counting methods. Please clarify.

Open Peer Review
Plant materials and tissue collection. Were these two lines/accessions propagated by single seed descent, such that one would expect every plant within each line/accession to be genetically identical or nearly so? Or was more than on plant taken forward at each generation? Single seed descent for many generations accomplishes homozygosity, providing a single haplotype, which simplifies sequence assembly and alignments. If records are available to address this, then please add that information. Table 3. The values in the 90% column all are considerably lower than the 50% and 70% columns. Is there a simple explanation for this?
We received IT97K-499-35 from Mike Timko, who had previously received it from IITA. We took it through three rounds of SSD before bulking and, fortunately, found that the plant used to establish the bulk seed was 100% homozygous. Please adjust the wording on p.4 to indicate that Mike Timko was our source of this accession.
JGI annotated the pseudomolecules that were developed at UC Riverside.

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound?

Gates Open Research
Are the datasets clearly presented in a useable and accessible format?

Yes
No competing interests were disclosed. The Data Note "Assembled genomic and tissue-specific transcriptomic data ressources for two genetically disctinct lines of Cowpea ( (L.) Walp)" by A. Spriggs and co-workers Vigna unguiculata describes the obtention of genomic and transcriptomic data from two Cowpea varieties, IT97K-499-35 and IT86D-1010.
While an unclustered genome assembly is already publicly available for Cowpea ( v1.1; Vigna unguiculata https://phytozome.jgi.doe.gov/pz/portal.html), the genomic dataset presented in this Data Note for two distinct varieties of Cowpea increases the sequence availability for this species.
The transcriptomics data obtained mainly focusses on reproductive tissues, including anthers and ovules from dissected floral buds at different stages. The five reproductive stages used for RNAseq are pooled male gametophyte (PMG), megaspore mother cell (MMC), female meiotic tetrad (FMT), mature female gametophyte (MFG) and early developing seeds (ES) containing young developing embryos. In addition to these, young unexpanded leaves were also used. All those samples were sequenced in triplicates for both varieties of cowpea, except MMC and FMT from IT97K-499-35 where only two replicates were sequenced. These six tissue specific datasets per variety will be a very useful ressource for differencial expression analysis in reproductive studies of Cowpea.
The rationale to create the datasets is clearly and well explained in the introduction section. The methods and protocols used could be expanded as detailed below. The datasets are accessible though CSIRO Data Access portal, and are presented appropriately in the article as data analysis summaries in tables.
Comments: Table 5 and 6 show "details of the assembled tissue-specific polyA RNA sequences" from both varieties, but the nucleic acid extraction section of the Methods does not mention polyA RNA isolation. If polyA RNA was obtained from each tissue, the protocol used for it should be detailed in the methods.
There are no details about the genomic and cDNA library constructions. Even if those were done commercially by AGRF, details on how the libraries were made should be provided.
The analysis of the raw sequences was done using Biokanga, a CSIRO developed suite of Next Generation Sequencing analysis tools. Some information about the quality control analysis of the reads obtained for each library should be presented (Basic statistics, Per base and per sequence reads obtained for each library should be presented (Basic statistics, Per base and per sequence quality scores, GC content, Miscalled bases…) in order to evaluate the quality of the raw data.

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound? Yes

Are sufficient details of methods and materials provided to allow replication by others? Partly
Are the datasets clearly presented in a useable and accessible format? Yes No competing interests were disclosed.

Competing Interests:
Referee Expertise: Plant developmental biology, molecular biology I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.