Submit data to NCBI’s Short Read Archive

I was recently contacted by a SRA curator to submit the raw pacbio datasets that go with genomes that were deposited to NCBI. I did go through with the submission and will share what I did and my experience doing so.

Be prepared to answer a lot of questions regarding your project as well as write short paragraphs about the project or specific samples. The more information you add the better for the community.

When submitting reads to the archive, you will need to reference BioSample that is itself linked to a BioProject. In my case those were already created because the genome were already on NCBI. I just added the identifiers when prompted.

How to submit to SRA will take you to the SRA’s tutorial.

Log in to the SRA website using your NCBI account. If you do not have one, create one.

Create a new submission. All your data that is grouped under the same submission will have the same release date. You can set your data to be released at a particular date. The earliest you can do is the following day at midnight. I am not sure how long you can keep your data private but the refilled release date is set for 1 year from the day you start the submission.

Once your submission is create, you can create a new experiment. The experiments are linked to a sample. You can have multiple experiments link to the same BioSample. For this step you will need:

The name of the platform, alias, title and a BioProject identifier, BioSample identifier. You will also need to specify a name for the library construction, a strategy (from a list of predefined terms), Source and Selection method used.

Optionally you can add links to other Dbs or websites for the particular experiment. A pipeline information can also be entered but is not required. Once you have entered the required data, you can save and your experiment will now show up in the submission table.

Add a run to the experiment by clicking “new run”, give it an alias so you can remember which dataset you are uploading. Select the data type (in my case PacBio_HDF5).

For HDF5 submission, you need to submit 4 files. 3 bax.h5 file and 1 bas.h5. I compressed the files using gzip but it did not reduce the file size that much (maybe I will skip this next time). You can find the h5 files in the Analysis_Results folder in the PacBio data that I downloaded from the sequencing center.

I have not submitted any other file format so far. There are descriptions on the SRA website for what they are expecting for each format or platform.

Before you upload the files to the SRA ftp, you will need to enter the file name as well as the MD5 checksum for the file. This is used to make sure the file that they received is the same as the file you are intending on sending. It allows the sra people to make sure the file is complete and not corrupted incase of interrupted connections or other things that can happen when transferring 1 or 2 Gb files. The description and help text on the new run page is fairly clear.

On my mac, the command to get the checksum is

> MD5 <filename>

Once the 4 filenames have been added with their corresponding checksums, I needed to actually send the files over to the SRA. This is what gave me the most trouble. My ftp/sftp setting must be set up in such a way that it was not allowing me to use the command line tool. I ended up using FileZilla with the appropriate credentials using port 21 (the port number can change depending on your settings but it worked for me). I created a directory with my BioProject ID so I wouldn’t just dump data in a top level directory. I pushed my 4 files over and waited. The transfer rate was very fast (1.5Mb/s) most of the time but dropped to a 100Kb/s for a little while. I would recommend letting this happen over night or on the weekend. Make sure your computer does not go to sleep after long periods of inactivity and that your internet connection stays alive.

Repeat for the other runs and samples you may have. I had 6 samples with 12 runs (about 40Gb of data) took 4-5 hours of upload time and a couple of hours to learn the submission form, gather and enter the required data.

Once the data is uploaded the SRA system will link the files to the runs you just added to your experiment. You can check the status of the file links here

The following websites/papers were very helpful.

Section 12 (Data Submission) of the Swab to Genome paper describes the Biosample and Bioproject creation.


Leave a Reply

Guillaume Jospin

Guillaume Jospin is a bioinformatics engineer in Jonathan Eisen's lab.