Using the GenBank Submission Tool Tutorial

Learn how to correctly format sequences and alignments for submission to GenBank using the Geneious GenBank Submission tool, including adding the required GenBank meta-data and editing annotations so they contain the correct qualifiers.

Introduction

Most scientific journals require researchers to upload DNA and protein sequences cited in articles to public databases as part of the publication process. The main databases, which collectively form the International Nucleotide Sequence Database Collaboration, are GenBank, administered by the National Centre for Biotechnology Information (NCBI) in the USA (http://www.ncbi.nlm.nih.gov/genbank), the European Molecular Biology Organisation (EMBO) database in Europe (http://www.ebi.ac.uk), and the National DNA Databank of Japan (DDBJ, http://www.ddbj.nig.ac.jp).

These databases exchange information daily so it is only necessary to submit the sequence to one database. Once a sequence is submitted it is assigned an accession number, which allows other researchers to find and retrieve the sequence when the journal article is published.

The GenBank submission tool allows you to upload your sequences directly to GenBank from within Geneious Prime, retaining the annotations and features that will appear on the GenBank record.

INSTRUCTIONS
To complete the tutorial yourself with included sequence data, download the tutorial and install it by dragging and dropping the zip file into Geneious Prime. Do not unzip the tutorial.

DOWNLOAD TUTORIAL

SUBMISSION RULES
What you can submit

EXERCISE 1
Single sequence protein-coding gene

EXERCISE 2
Alignment of non-coding mtDNA sequences

TROUBLESHOOTING
Common Problems

GENEIOUS ACADEMY

What you can submit

The GenBank submission tool implements BankIT and is thus only designed for simple submissions of a small number of sequences from mRNA, genomic DNA, organelles, ncRNA, plasmids, and some viral or phage genomes.

Sets of sequences from ribosomal RNA (rRNA), rRNA-ITS, metazoan mitochondrial COX1, Influenza, Norovirus, Dengue or SARS-CoV-2 cannot be submitted via Geneious. These should be submitted through the GenBank Submission Portal

Small genomes, such as chloroplasts, mitochondria, plasmids, phages, and viruses can be submitted through Geneious. However larger genomes (e.g. those requiring Locus_tag or BioProject registration such as bacterial and eukaryotic chromosomes) cannot be submitted through Geneious.

ESTs (expressed sequence tags), STSs (sequence tagged sites), and GSSs (genome survey sequences), High-Throughput Genomic (HTGs) Sequences, Whole Genome Shotgun (WGS) Sequences and Transcriptome Shotgun Assembly (TSA) Sequences also cannot be submitted through Geneious. These need to be submitted using the appropriate channels on the NCBI website.

Raw sequencing reads from next-generation sequencing platforms should be submitted through the Short Read Archive (SRA).

The following data is not accepted by GenBank:

Non-contiguous sequences
Primer sequences
Protein sequences with no underlying nucleotide submission
Sequence containing a mix of genomic and mRNA sequence
Sequences without a physical counterpart (consensus sequences)
Sequences with length less than 200 nucleotides

For more information on submission types see the NCBI website.

Exercise 1: Submission of a protein coding gene

1a. Adding GenBank fields to your document

The sequence Sppu-UZ is a partial sequence of a Major Histocompatibility Complex gene. It was isolated from the genomic DNA of Sphenodon punctatus (tuatara), a reptile native to New Zealand.

This portion of the tutorial will take you through the steps required to prepare the annotated gene sequence, Sppu-UZ, for submission to GenBank.

This first step involves adding general information about the sequence, including a description of the sequence, information about the source organism, tissue type, geographic location, sampling dates, etc. All GenBank submissions require this information.

To add this information, select the sequence in the Document table, then click the Info button in the document viewer panel. In the Properties tab you can add information about your sequence which can then be mapped to a GenBank field in the submission tool.

You will see that Name:, Description: and Molecule Type: are already entered. Add the organism to this document by clicking on Organism: and typing “Sphenodon punctatus“.

You can add additional information to map to GenBank fields by clicking Add meta data and choosing GenBank Submission as the meta-data type.

The default fields allow you to add information about the specimen from which the sequence was derived. These fields do not all have to be filled in if they are not relevant to your sample, and you can add additional fields as required.

For this sequence we want to add a sequence ID and some information on the sample that the sequence was derived from, but we don’t require information on the sampling locality. The Sequence ID should be a unique identifier that allows each sequence to be identified at all steps in the submission process before a unique accession number is assigned. It must not contain any gaps. In this case we will use the allele name, so click on Sequence ID and type Sppu-UZ03. Under Specimen Voucher type NZFT1234. This is the reference number for the blood sample that the sequence was derived from.

To add information on the tissue type we will add an additional field. Click Edit meta-data types, ensure GenBank Submission is selected, and click the + sign next to the Collected By text. This will bring up a blank box as in the screenshot below. Type “Tissue Type” in here and click OK.

Then in the Info window click Tissue type and enter “Blood”. Leave the other fields blank and click Save.

1b. Formatting annotations for protein coding genes

GenBank records for protein-coding genes require information on the coding region, intron/exon boundaries, and protein translation for that gene. In this exercise you will learn how to correctly format annotations containing this information.

Switch back to the Sequence view of the Sppu-UZ sequence. You’ll see that this sequence already has CDS and exon annotations. For submission to GenBank, protein-coding genes also require a “gene” annotation. Add this annotation by selecting the entire sequence, then clicking the Add Annotation button in the toolbar to bring up the annotation dialog.

Under Name type “Sppu-UZ”, and select gene as the Annotation type. Gene annotations require a gene qualifier for submission to Genbank: Click on Add next to the Properties tab, and type “gene” next to Name: and “Sppu-UZ” next to Value:.

We will also add the name of the allele here (this is optional, but good practice if you know the allele name). Click Add in the Properties tab again, and type “allele” for Name and “Sppu-UZ*03” for Type.

These sequences represent only a fragment of the Sppu-UZ gene, so we need to indicate that the gene annotation represents a partial feature. To do this select the Interval (1->1690) and click Edit. Check Truncated left end and Truncated right end and click OK. Click OK again to go back to the sequence view.

We now need to add the appropriate qualifiers to the CDS and exon annotations. If your sequence contains more than one gene, it is good practice to add a “gene” qualifier to each CDS and exon annotation so that you can easily see which gene they are from. In this example we only have one gene, so it is not strictly necessary, but we will add it anyway. Select both the CDS and the two exon annotations by holding down the control (windows) or command (mac) key and clicking on the colored bars for the annotations. Click Edit Annotations and add a gene : Sppu-UZ qualifier under Properties as you did above for the gene annotation. Click OK. This will add this qualifier to all the annotations you have selected.

CDS annotations also require a transl_table qualifier, representing the genetic code used in translation (see NCBI genetic codes for details), a codon_start qualifier, representing the frame of the translation from 1 to 3, and a product qualifier, describing the protein name. Note that you do not need to add the actual protein translation, as this is worked out by GenBank on the basis of the transl_table and codon_start qualifiers. To add these qualifiers, click on the CDS annotation and click Edit Annotations again. Add the following under Properties as you did for the gene qualifier above:

Name: transl_table; Value: 1
Name: codon_start; Value: 3
Name: product; Value: MHC class I antigen

Note that these qualifier names are case-specific. Double check these qualifiers are typed exactly as shown, otherwise they will generate errors during the submission process. See the troubleshooting section at the end of this tutorial for more details on errors.

Click OK.

The Exon annotations each require a number qualifier. Select the exon 2 annotation and click Edit Annotation. Click Add next to the Properties, and add Name = number and Value = 2. Click OK. Add number : 3 to the Exon 3 annotation in the same way. When you have finished adding all the qualifiers to the annotations, click OK, then Save.

In summary, for a protein-coding gene, the required annotations and qualifiers are:

Gene Annotation:

A gene qualifier e.g. gene : Sppu-UZ

CDS Annotation:

A transl_table qualifier e.g. transl_table : 1 (Valid transl_table values are given here)
A codon_start qualifier, e.g. codon_start : 3
A product qualifier, e.g. product : MHC Class I antigen

Exon annotations are optional, but if they are present they must include the qualifier “number”.

You can add additional qualifiers in the Properties section of the Edit Annotations window if you wish. A list of valid annotation types and qualifiers is given here.

1c. Filling out the “Submit to GenBank” form

Our sequence is now ready to submit to GenBank. Select the sequence and go Tools → Submit to GenBank. Type in a Submission name (e.g. Tutorial 1), and check Save a local file (.tar). This will save your submission to your hard drive rather than submitting it to GenBank.

Now click Edit Publisher Details and add your name and address details. You also need to add the title of the journal article that this sequence will appear in under Reference title. If your paper is not yet published, check the “Unpublished” button. It doesn’t matter if you don’t yet know the name of your journal article – an approximate name is OK, as this will be updated when the accession numbers for the sequence are published in an article. For the purposes of this tutorial, type “TEST” next to Reference Title and click OK.

We now need to map the fields that we set up on our document in Exercise 1a onto the GenBank fields. Firstly, enter a project name, such as “tuatara MHC” – this is for your reference purposes. Next, click on the arrows next to Specimen Voucher to bring up a drop-down menu. You will see that the GenBank Submission fields that we added in Exercise 1a appear as options here. Select Specimen Voucher (GenBank Submission).

Select the following to populate the other fields: Organism = Organism, Genetic code = Standard; Sequence ID = Sequence ID (GenBank Submission). For Molecule Type select “Genomic DNA”, and for Genomic Location select “Genomic”. The other fields (Identified by, Collected by, Country, Collection Date etc) can be left displaying None. We also need to add the Tissue type field that we added to our document in Exercise 1a. Check Include extra fields and click Choose…, then enter Field Name Tissue_type and Field Value Tissue type (GenBank Submission). Click OK.

To add the gene, CDS and exon annotations into our submission, check the Include Features/Annotations box. Finally, make sure the Include Primers box is unchecked, as we are not submitting primers with this sequence. Your Submit to GenBank dialog box should now look like this:

You can now click OK to test the submission, but before you do this double-check that you have chosen to Save as a local file, not Upload submission, as we do not want to submit the tutorial sequence to GenBank! After you click OK, you will get a discrepancy report. This is given for every submission and does not necessarily mean your submission contains errors. If there are errors that are likely to interfere with the submission you will see a Validation Errors/Warnings window instead, detailing errors that should be attended to before the submission proceeds.

If you wish to see a preview of what your submission will look like in Genbank format, click the Genbank Preview tab above the Discrepancy report.

Click Save Tar File to save the .tar file to your desktop. You have now completed a test submission of a protein coding gene. If you wish to submit real data to GenBank, simply check the Upload submission box rather than Save a local file.

Exercise 2: Submission of an alignment of non-coding mtDNA sequences

In this exercise we will prepare an alignment of mtDNA sequences to submit to GenBank. In the document table there is an alignment file and six individual sequence files containing the sequences that are aligned. When submitting an alignment, you must have the original sequence files linked to your alignment, otherwise you will get an error message upon submission stating “Sequence x is lacking a reference”. Click on the alignment. Check that each sequence has a blue arrow to the left of it as in the screenshot below. This shows that the sequences in the alignment are linked to their source files.

2a. Formatting Annotations

This alignment contains tRNA sequences and the non-coding D-loop region of the mitochondria. Formatting these annotations is somewhat simpler than formatting protein-coding gene annotations: for tRNA and rRNA genes you only need a “product” qualifier giving the name of the gene. You can add the product qualifiers by doing a batch edit of annotations across the alignment.

The easiest way to bulk-select annotations is via the Annotations Table. Select the Annotations tab at the top of the sequence viewer to bring up the table, and ensure all annotations are displayed (click the Type button and choose “Show All”). Then sort the table by the “Name” column by clicking on the Name column header. Then select all of the tRNA-Phe annotations by holding down the shift key, and click Edit Annotation. Under Properties click Add and enter “product” next to Name, and tRNA-Phe next to Value. Click OK twice to go back to the annotation table.

Do the same thing for other two tRNA annotations (tRNA-Pro and tRNA-Ser), giving them the appropriate product names. We do not need to add any qualifiers to the D-loop sequence as it is non-coding. Save your alignment and click Yes when asked if you want to apply changes to the original sequences.

2b. Adding GenBank fields to your document

GenBank fields such as sequence ID and Specimen Voucher should be added to the individual sequence documents rather than the alignment as they are unique to each sequence. The required fields have already been added to the individual sequence documents for this example. Click on the sequence and select the Info tab. For these sequences, the sampling location is given in the Description field, and the ID for the blood sample from which the sequence was isolated is in the Specimen Voucher field. The collection date and organism (Sphenodon punctatus) have also been added. These fields have been added to all the sequences present in the alignment.

Select the alignment and select Tools→Submit to GenBank. Give your submission a name (e.g. Tutorial 2) and ensure “Save a local file” is checked as we do not want to submit the sequences from this tutorial to GenBank. Keep the same Publisher Details as for Exercise 1, with your name and address details entered and “TEST” entered as the Reference Title.

This alignment represents a population study, as it comprises sequences of the same mtDNA region isolated from 6 different individuals of the same species. We can indicate this by choosing Population Study under Submission Type.

Enter a project name, such as “tuatara mtDNA”, then map your document fields to GenBank fields as follows:

- For Specimen voucher, select Specimen Voucher (GenBank Submission)

- For Molecule type select Genomic DNA

- For Genetic Location select Mitochondrion

- For Sequence ID select Name

- For Collection Date select Collection Date (GenBank Submission)*

- for Organism select Organism

- For Genetic code select Vertebrate Mitochondrial

*Note that Collection Date must be a recognized date field, and only these fields will show up in the drop-down options.

We also wish to add Isolation source and Isolate name as fields, so that information on the sampling location and sample name is given along with the sequence.

Map these fields by checking Include extra fields then clicking Choose. Add the fields Isolation Source and map it to Description, and Isolate, mapped to Name. Note that Sequence ID is also mapped to “Name”, however the Sequence ID is only used for identifying the sequence through the submission process, so for the ID to appear in the GenBank flat file it should be mapped to another field.

To add the sequence annotations into our submission, check the Include Features/Annotations box. Make sure Add Gene & CDS features using fields is unchecked, as these sequences do not have protein-coding genes. Include quality scores, Include structured comments and Include Primers should also be unchecked.

Your Submit to GenBank dialog box should now look like this:

If you wish to see a preview of what your submission will look like in GenBank format, click the GenBank Preview tab above the Discrepancy report.

Click Continue and save the .tar file to your desktop. You have now completed a test submission of a mitochondrial DNA alignment. If you wish to submit real data to GenBank, simply check the Upload submission box rather than Save a local file.

Troubleshooting Common Problems

How do I set the release date for my submission?

Its not possible to set this via the Geneious plugin. However, Genbank will email you around 2 working days after your submission with a preview of your files, and you can notify them at that point of the date you want the sequences to be released.

This file cannot be submitted. Tbl2asn produced a submission file with no sequences.
Tbl2asn Output
[NULL_Caption] This copy of tbl2asn is more than a year old. Please download the current version.
[tbl2asn 25.6] SeqID lcl|xxxx is present on multiple Bioseqs in record

The first part of the error (this copy of tbl2asn is more than a year old) can be ignored, this does not prevent submission and you do not need to download a new version of tbl2asn. The error is listed on the last line and indicates an invalid field has been chosen as the Sequence ID – “xxxx” will be replaced by the invalid field in your error message. If two or more of the sequences you are submitting have the same Sequence ID you will get this error. The Sequence ID must contain a different value for each sequence in your submission, so that each sequence can be identified during the submission process before a unique accession number is assigned. You will need to check that you have chosen an appropriate document field for the Sequence ID in the Genbank Submission setup window.

Inconsistent taxnames

This error can occur when the organism name specified in the Submit to Genebank window does not match the organism label in the source annotation on your sequence. This often occurs if you have inadvertently transferred a “Source” annotation from another sequence to your sequence. This source annotation may not be visible if the check box next to this annotation type is not ticked in the Annotations panel. To fix this error, you will either need to delete the source annotation or correct the organism label on the source annotation so that it matches your organism name exactly.

Illegal start codon used

NCBI will automatically generate the amino acid translation for any CDS annotation using the transl_table and codon_start labels on the annotation. This error message occurs when this translation does not start with the start codon, M, and you have not specified that the CDS is partial. To correct this, you will need to change the transl_table and/or codon_start label using the Edit Annotations window so that the first codon is a start codon. Alternatively, if this CDS is a partial CDS sequence, open the Edit Annotations window and double click on the annotation interval. In the window that pops up, specify that the 5′ end of the annotation is truncated.

Missing stop codon

This error message occurs when the automatic translation of a CDS annotation does not end with a stop codon and the CDS is not labelled as partial. Check that the frame and translation table for this annotation are correct, and, if not, change these properties. Alternatively, if this CDS is a partial CDS sequence, open the Edit Annotations window and double click on the annotation interval. In the window that pops up, specify that the 3′ end of the annotation is truncated.

Some mitochondrial genomes contain CDS’s that do not include a DNA-encoded stop codon. A stop codon is created after transcription by polyadenylation, where the terminal ‘T’ or ‘TA’ of the transcript is combined with the poly-A tail to form a ‘TAA’ stop codon. These non-standard CDS’s require an extra property attached to the CDS to be acceptable for submission. You need to add the property transl_except to the CDS. The figure below shows the format for this property, pos:XXXX indicates the position of the terminal ‘T’ nucleotide of the CDS, aa:term indicates that two A’s should be appended to the 3′ terminus. If the CDS ends TA, then the Value: would be (pos:1..2, aa:term) where 1 is the sequence position of the ‘T’ nucleotide and 2 is the position of the ‘A’.

Missing qualifiers or features

Annotation qualifiers should be written in lowercase letters, otherwise they will not be recognised. For example, if you write the “product” qualifier, which is required for CDS and other amino acid feature (e.g. tRNA) annotations, with a capital “P” (Product) you will get error messages such as:
Missing encoded amino acid qualifier in tRNA feature, or
No protein BioSeq given FEATURE:CDS and
Expected CDS Product absent

“Sequence x is lacking a reference” How do I create a reference?

This error message occurs if you are submitting an alignment to GenBank and one (or more) of the sequences in your alignment is not linked to an original sequence document. This will occur if you have imported the alignment into Geneious, but may also occur if you have edited a sequence in the alignment and chosen not to apply the changes to the original sequence when saving (this will break the link to the original sequence).

Sequences in the alignment that are linked to an original sequence document will have to the left of the sequence name in the alignment. Clicking this link will take you to the linked original sequence document. For sequences in the alignment that are missing a linked original sequence document, it is possible to create a reference sequence without realigning the sequences. To do this, select the sequence name in the alignment (this will select the entire sequence) and click the Extract button. This will create a new sequence document. Any additional information you need to add to the submission (for example collection data) should be added to this new document.