D. erecta Annotation Procedure

Tools used for genome annotation

Download the project package from here.

Typical Eukaryotic Gene Structure:

Procedure of genome annotation:

Identify the likely ortholog in D. mel using blastp on flybase
Use D. mel. database to find gene model of ortholog and identify protein seq for each exon
Use BLASTX to locate exons; search one by one, find conservation, note position and frame
Based on locations, frames of conservation, as well as other evidence create gene model; identify the exact base location (start and stop) of each CDS (coding exon) for each isoform
Confirm your model using Gene checker and genome browser.

More detailed procedure:

You are provided a zip file named "derecta_3Lextended_Jan2008_fosmid9.zip". Unzip this file you will obtain a folder named "derecta_3Lextended_Jan2008_fosmid9". This folder has two subfolders: "analysis" and "src", and three files: two of them are project report files and one is a README file, from which you can get more description about the contents of the subfolders.
Go to subfolder "src", get sequence named "fosmid9.fasta.masked". This is a plain text file. Use this masked genomic sequence when the genomic sequence is needed.
Go to GEP Genome Browser, select "D. erecta" for genome, "Jan.2008 (GEP/3L extended)" for assembly and type "fosmid9" in the "position or search term" box, then click "submit" button.
In the genome browser, in the "Mapping and Sequencing Tracks", turn the option for "Base Position" to "full" and click "refresh" button; in the "Genes and Gene Prediction Tracks", turn the option for "Genscan Genes" to "full and click "refresh" button; click one predicted gene by GenScan (for instance the one named as fosmid9.6), obtain the predicted protein sequence in the next page.

Predicted Gene (fosmid9.6)
Use blastp on the flybase website, choose "Annotated Protein" database, search the obtained predicted protein in the previous step against annotated D. mel protein database. Find the best match and determine the gene name in D. mel. Adjust blast parameters when needed.

BLASTP on Flybase:

BLASTP result for the predicted gene (fosmid9.6):
Search this gene in the GEP gene record finder, find D. mel gene details (exon amino acids sequences, gene structure etc.)

In this fosmid9.6 example, the D.mel gene symbol is "Syx7" and the gene record is shown below:

This gene has two isoforms: Syx7-RA and Syx7-RB. Each isoform has 5 exons (2_307, 3_307, 4_307, 5_307, 6_307). For this particular example, these two isoforms of the gene has identical gene structure in the coding region. For some genes, different isoforms have different gene structure and you should be aware of which isoform you are work on (by default, the first isoform is selected).
When you work on your gene, you are also suggested to get a screenshot of this gene record and put it into a word document. Then you click "Export Sequences for Selected Isoform to FASTA" tab to get the exon sequences of this isoform.
Search the masked genomic sequence ("fosmid9.fasta.masked") against each amino acid sequence of exon in each isoform of D. mel using blastx on the NCBI website (The genomic sequence as query and each exon as subject. Make sure you change expect value to 0.1 and check off the "low complexity region"). Copy and paste the obtained alignment into the word document.
BLASTX search:

BLASTX search result summary:

Alignment generated from BLASTX search:

Clean up the blastx output by removing insignificant matches, and draw a picture of your gene structure based on the gene record from gene record finder and the blastx output. Here is the procedure:
Assume you have five exons in the isoform of the gene you are working on.
If "Frame" in blastx output is +1, +2, or +3

If "Frame" in blastx output is -1, -2, or -3

The arrow indicates the direction you should read the sequence. In this particular example, the picture for positive frame will be used.
Determine precise exon boundary by using signals (ATG, GT, AG, TAA,TAG,TGA), phase, conservation and frame information in the genome browser
If "Frame" in blastx output is +1, +2, or +3

If "Frame" in blastx output is -1, -2, or -3

In the genome browser, you will determine the correct boundaries of each exons at the neighborhood of matched regions given in the blastx alignment. The boundaries are determined by searching for the start codon (ATG), splice sites (GT, AG), stop codons (TAA, TAG, TGA), and matched phases. One specific example to example the whole procedure is given below.
- First, you work on the start of the first exon (matched to exon Syx7:2_307 in D.mel).
```
>lcl|34025 Syx7:2_307
Length=40

 Score = 77.4 bits (189),  Expect = 2e-18
 Identities = 40/41 (98%), Positives = 40/41 (98%), Gaps = 1/41 (2%)
 Frame = +1

Query  34975  MDLQHMENGLSGGGGGGGLSEIDFQRLAQIIATSIQKVQQN  35097
              MDLQHMENGLSGGGGGG LSEIDFQRLAQIIATSIQKVQQN
Sbjct  1      MDLQHMENGLSGGGGGG-LSEIDFQRLAQIIATSIQKVQQN  40
```
  According to the alignment from blastx, the translation frame is +1, and matched region in the genomic sequence is 34975-35097 and it matches from the beginning to the end of the D.mel exon (1-40).
  
  In the genome browser, you want to display the start of exon in the middle of the browser. Since the beginning of the first exon in this gene in the analyzed genomic sequence is expected to be around the matched position to the beginning of the D.mel exon (Syx7:2_307), in other words, around 34975, you want to specifically focus on this position (34975) to perform the analysis in the genome browser. To achieve that goal, you provide a range in the genome browser by +/-10 to this number (34965 - 34985). For all other exon boundaries, you will use the same strategy to show the desired boundary in the middle of the browser.
  
  Now you need to check a few items in the genome browser before you determine the exon boundary. For each gene, this check only need to be done once.
  - First, after you provide the specified range (34965-34985) in the "position/search" box and click "jump" button, you should see four new rows you did not see when the whole fosmid was displayed. The first row is the DNA sequence of the fosmid in this given region. The second, third, and fourth rows are the three possible translations of this DNA sequence, with different reading frames. If you do not see these rows, you should go to "Mapping and Sequencing Tracks" and change the option for "Base Positions" to "full", then click the "refresh" button in that track.
  - Second, you need to check the arrow (--->) at the top left corner of the browser.
    - When the frame is +1, +2, or +3 in the blastx alignment, this arrow should point to the right hand side. In this case, the second to fourth new rows in the genome browser correspond to frame +1, +2, and +3, respectively.
    - When the frame is -1, -2, or -3 in the blastx alignment, this arrow should point to the left hand side. If the arrow points to the left hand side, the three rows correponds to frame -1, -2, and -3, respectively.
    To switch the direction of the arrow, you just need to click on the arrow.
  For this particular example, the arrow at the top left corner of the browser should point to the right hand side since the frame is +1 according to the alignment from blastx. You also should look for signals in the first translation row to determine the boundaries of the first exon.
  
  Now you can look for the start codon (ATG) around 34975 since this is the first exon of this gene. In the genome browser, the start codon is shown as a green box with a white "M" in the middle of the box. Remember you are looking for the signal in the first translation row (frame +1). Apparently, the start codon starts at 34975 in that row. Therefore, the 5' end boundary of the first exon is 34975.
  
  Then you can start to identify the 3' end boundary of the first exon. Again, according to the alignment from blastx, you know this boundary is at the neighborhood of 35097. You provide a range (35087-35107) to display this position in the middle of the browser. You are still search for signal in the first translation row (frame +1) and you are looking for a splice signal (GT) around 35097. You notice that there are multiple GTs available in this region and preference should be given to the one closest to the matched position (35097. This rule is applied to all the other splice sites. You notice that right after 35098, there is a splice signal (GT). Since the exon boundary is right before the splice site, in this case, 35098 is the 3' end boundary of the first exon.
  
  The next step is to determine the phase of the following exon. Since there is one additional nucleotide (G) after complete codons at the boundary of the first exon, the phase of the following exon is 1 and you expect to see two nucleotides before the beginning of complete condons in the second exon.
  
  Therefore, you have the desired information for the first exon:
  
  Start: 34975
  End: 35098
  Phase: 1
- You apply the samilar procedure on the second exon. The only difference is, you do not look for start codon any more. For both ends, you look for splice sites. You look for 'AG' at the 5' end boundary and 'GT' at the 3' end boundary. You also need to make sure that the phase matches with its adjacent exons.
  
  For the second exon, you have the following alignment from blastx search:
```
>lcl|34026 Syx7:3_307
Length=22

 Score = 47.8 bits (112),  Expect = 1e-09
 Identities = 22/22 (100%), Positives = 22/22 (100%), Gaps = 0/22 (0%)
Frame = +2

Query 35165  STMQRMVNQLNTPQDSPELKKQ  35230
              STMQRMVNQLNTPQDSPELKKQ
Sbjct  1      STMQRMVNQLNTPQDSPELKKQ  22
```
  Based on this alignment, you first provide a range (35155-35175) in the genome browser to display the neighborhood of the matched position (35165). Then you look for the splice site signal (AG) around 35165. You notice that there is a splice signal (AG) at 35161-2. Sinc the frame is +2, you read the second translation row and you notice that there are two nucleotides (T and G) before the first complete codon of this exon (TCC in this example), which is exactly what you expected since the previous exon has one extra nucleotide (G) after all complete codons at the end of the exon (phase 1). These three nucleotides (G from the previous exon, and TG from this exon) can form a complete codon. Therefore, the 5' end boundary of the second exon is 35163.
  
  Then you can investigate the 3' end of the second exon. Again, you first provide a range (35220-35240) according to the alignment from blastx. At the neighborhood of 35230, you find one splice signal (GT) at 35233-4. Since the frame is +2, you have two nucleotides (C and T) left after the last complete codon in this exon. Therefore, the 3' end boundary of this second exon is 35232 and the following exon is in phase 2.
  
  Here you record the desired information for the second exon:
  
  Start: 35163
  End: 35232
  Phase: 2
- Third Exon
```
>lcl|34027 Syx7:4_307
Length=48

 Score =  100 bits (249),  Expect = 2e-25
 Identities = 48/48 (100%), Positives = 48/48 (100%), Gaps = 0/48 (0%)
 Frame = +2

Query  35516  HQIMTYTNQLVTDTNNQINEVDKCKERHLKIQRDRLVDEFTAALTAFQ  35659
              HQIMTYTNQLVTDTNNQINEVDKCKERHLKIQRDRLVDEFTAALTAFQ
Sbjct  1      HQIMTYTNQLVTDTNNQINEVDKCKERHLKIQRDRLVDEFTAALTAFQ  48
```
  To determine the boundaries of all the other internal exons, the procedure is exactly the same as the one for the second exon.
  
  Based on this alignment, you first provide a range (35506-35526) in the genome browser to display the neighborhood of the matched position (35516). Then you look for the splice site signal (AG) around 35516. You notice that there is a splice signal (AG) at 35513-4. Sinc the frame is +2, you read the second translation row and you notice that there is one nucleotide (C) before the first complete codon (CAC in this example) in this exon, which is exactly what you expected since the previous exon has two extra nucleotides (C and T) after all complete codons at the end of the exon (phase 1). These three nucleotides (C and T from the previous exon, and C from this exon) can form a complete codon. Therefore, the 5' end boundary of this exon is 35515.
  
  Then you can investigate the 3' end of the third exon. Again, you first provide a range (35649-35669) according to the alignment from blastx. At the neighborhood of 35659, you find one splice signal (GT) at 35660-1. Since the frame is +2, you notice that this exon ends with one complete codon (CAG). Therefore, the 3' end boundary of this exon is 35659 and the following exon is in phase 0.
  
  Here you record the desired information for the third exon:
  
  Start: 35515
  End: 35659
  Phase: 0
- Fourth Exon
```
>lcl|34028 Syx7:5_307
Length=139

 Score =  264 bits (674),  Expect = 4e-74
 Identities = 135/139 (97%), Positives = 138/139 (99%), Gaps = 0/139 (0%)
 Frame = +2

Query  35741  AVQRKTADIEKTALRQARGDSYNIARPPGSSRTGSSNSSASQQDNNSFFEDNFFNRKSNQ  35920
              +VQRKTADIEKTALRQARGDSYNIARPPGSSRTGSSNSSASQQDNNSFFEDNFFNRKSNQ
Sbjct  1      SVQRKTADIEKTALRQARGDSYNIARPPGSSRTGSSNSSASQQDNNSFFEDNFFNRKSNQ  60

Query  35921  QQLQTQMQEQVDLQALEEQEQVIRELENNIVGVNEIYKKLGALVYEQGLTVDSIESQVEQ  36100
              QQ+QTQM+EQ DLQALEEQEQVIRELENNIVGVNEIYKKLGALVYEQGLTVDSIESQVEQ
Sbjct  61     QQMQTQMEEQADLQALEEQEQVIRELENNIVGVNEIYKKLGALVYEQGLTVDSIESQVEQ  120

Query  36101  TSIFVSQGTENLRKASSYR  36157
              TSIFVSQGTENLRKASSYR
Sbjct  121    TSIFVSQGTENLRKASSYR  139
```
  Based on this alignment, you first provide a range (35731-35751) in the genome browser to display the neighborhood of the matched position (35741). Then you look for the splice site signal (AG) around 35741. You notice that there is a splice signal (AG) at 35739-35740. Sinc the frame is +2, you read the second translation row and you notice that this exon starts with a complete codon, which is exactly what you expected since the previous exon ends with a complete codon. Therefore, the 5' end boundary of this exon is 35741.
  
  Then you can investigate the 3' end of this exon. You first provide a range (36147-36167) according to the alignment from blastx. At the neighborhood of 36157, you find one splice signal (GT) at 36158-9. Since the frame is +2, you notice that this exon ends with one complete codon (AGG). Therefore, the 3' end boundary of this exon is 36157 and the following exon is in phase 0.
  
  Here you record the desired information for the fourth exon:
  
  Start: 35741
  End: 36157
  Phase: 0
- Fifth Exon
```
>lcl|34029 Syx7:6_307
Length=32

 Score = 59.3 bits (142),  Expect = 5e-13
 Identities = 31/32 (97%), Positives = 31/32 (97%), Gaps = 0/32 (0%)
 Frame = +3

Query  36213  NKVRKKKLILVGILSAVLLAIILILVFQFKN*  36308
              NKVRKKKLILVGILSAVLLAIILILVFQFKN 
Sbjct  1      NKVRKKKLILVGILSAVLLAIILILVFQFKNX  32
```
  Based on this alignment, you first provide a range (36203-36223) in the genome browser to display the neighborhood of the matched position (36213). Then you look for the splice site signal (AG) around 36213. You notice that there is a splice signal (AG) at 36211-2. Sinc the frame is +3, you read the third translation row and you notice that this exon starts with a complete codon, which is exactly what you expected since the previous exon ends with a complete codon. Therefore, the 5' end boundary of this exon is 36213.
  
  Then you can investigate the 3' end of this exon. You first provide a range (36298-36318) according to the alignment from blastx. Since this is the last exon in this gene, at the neighborhood of 36308, you look for the stop codon (TAA/TAG/TGA) and you notice one at 36306-8 in the third translation row (since the frame is +3). Therefore, the 3' end boundary of this exon is 36305 and the stop codon locates at 36306-8.
  
  Here you record the desired information for the fourth exon:
  
  Start: 36213
  End: 36305
  Stop Codon: 36306-8
If the gene is one the reverse strand (frame -1, -2, or -3), make sure the arrow at the top left corner of the browser points to the left hand side. When you do the annotation, process every exon from right to left. When you read the signals, read them from right to left. Other than these, all the other steps are the same as described above.
Sometimes you have a gene with one single exon. In this case, you only need to identify start codon (ATG) and stop codon (TAA/TAG/TGA) to determine the boundary of the exon.
Forward strand:

Reverse strand:
Sometimes you may have a partial gene, instead of one complete gene in your sequence. In the case, you may notice that some blastx alignment is pretty bad. In that case, you only annotate the ones with good matches and ignore the ones with bad matches.
Sometimes, you cannot find significant match when you search the predicted gene to the D.mel protein database. In this case, you just ignore that prediction gene and move to a different one. Most likely, this is a wrong prediction and it is not worth to perform further investigation.

After the locations of all the exons of one isoform are determined, enter the obtained gene model into gene model checker to check the correctness of this obtained gene model. If this gene model passes the test, three files will be generated (nucleotide sequence of the gene, protein sequence of the gene, gene annotation) and the obtained gene model can be viewed in the genome browser. Collect all these results and put them into annotation project report.

Click the magnifier at the last line of the checklist, a UCSC genome browser feature view will pop up and the gene you annotate will be shown as "Custom Gene Model" in the genome browser.

Click the "Download" tab at the top right corner, you will be able to download three files, one is a GFF file (one standard gene annotation file); one is a nucleotide sequence of the gene; and one is a peptide sequence of the gene. Download these three files and put them into your project report.

track name="CustomModel" description="Custom Gene Model" color=200,0,0 visibility=2
fosmid9	GEP	CDS	34975	35098	.	+	0	gene_id "Syx7-PA"; transcript_id "Syx7-PA";
fosmid9	GEP	CDS	35163	35232	.	+	2	gene_id "Syx7-PA"; transcript_id "Syx7-PA";
fosmid9	GEP	CDS	35515	35659	.	+	1	gene_id "Syx7-PA"; transcript_id "Syx7-PA";
fosmid9	GEP	CDS	35741	36157	.	+	0	gene_id "Syx7-PA"; transcript_id "Syx7-PA";
fosmid9	GEP	CDS	36213	36305	.	+	0	gene_id "Syx7-PA"; transcript_id "Syx7-PA";
fosmid9	GEP	stop_codon	36306	36308	.	+	.	gene_id "Syx7-PA"; transcript_id "Syx7-PA";
fosmid9	GEP	exon	34975	35098	.	+	.	gene_id "Syx7-PA"; transcript_id "Syx7-PA";
fosmid9	GEP	exon	35163	35232	.	+	.	gene_id "Syx7-PA"; transcript_id "Syx7-PA";
fosmid9	GEP	exon	35515	35659	.	+	.	gene_id "Syx7-PA"; transcript_id "Syx7-PA";
fosmid9	GEP	exon	35741	36157	.	+	.	gene_id "Syx7-PA"; transcript_id "Syx7-PA";
fosmid9	GEP	exon	36213	36305	.	+	.	gene_id "Syx7-PA"; transcript_id "Syx7-PA";

>Syx7-PA_transcript
ATGGACTTGCAGCATATGGAGAATGGCCTAAGTGGCGGGGGCGGAGGGGGTGGTCTTAGC
GAAATAGATTTCCAAAGGCTGGCCCAGATTATAGCCACCAGCATCCAGAAGGTGCAGCAG
AATGTGTCCACGATGCAGCGCATGGTCAATCAACTAAACACGCCCCAGGATTCCCCGGAG
CTAAAAAAGCAACTCCACCAAATAATGACCTACACCAACCAGCTAGTGACCGACACAAAC
AATCAAATCAACGAGGTGGACAAGTGCAAGGAGCGCCATCTGAAGATCCAGCGGGATAGG
CTCGTGGACGAGTTCACGGCGGCACTGACCGCCTTCCAGGCCGTCCAGCGCAAAACGGCG
GACATAGAGAAGACGGCGTTGCGGCAGGCGCGCGGAGATAGCTACAACATCGCCCGTCCA
CCCGGCTCATCGCGTACCGGCAGCTCCAACAGCAGCGCCAGCCAGCAGGACAACAACTCA
TTCTTTGAGGACAACTTCTTCAATCGCAAATCAAACCAGCAACAACTGCAGACTCAGATG
CAGGAGCAGGTGGACCTGCAGGCCCTCGAGGAACAAGAGCAGGTCATCCGGGAGCTTGAG
AACAACATCGTGGGCGTGAACGAGATATACAAAAAGCTGGGCGCCCTGGTCTACGAACAG
GGACTGACGGTGGACTCCATCGAGTCGCAGGTGGAACAGACTAGCATTTTCGTCTCACAG
GGCACGGAAAATCTGCGCAAGGCGAGCTCTTACAGGAACAAAGTGCGAAAGAAGAAGCTG
ATTTTGGTGGGCATCCTGAGCGCCGTGCTGCTGGCCATAATCTTGATACTCGTCTTTCAG
TTCAAGAAC

>Syx7-PA_peptide
MDLQHMENGLSGGGGGGGLSEIDFQRLAQIIATSIQKVQQNVSTMQRMVNQLNTPQDSPE
LKKQLHQIMTYTNQLVTDTNNQINEVDKCKERHLKIQRDRLVDEFTAALTAFQAVQRKTA
DIEKTALRQARGDSYNIARPPGSSRTGSSNSSASQQDNNSFFEDNFFNRKSNQQQLQTQM
QEQVDLQALEEQEQVIRELENNIVGVNEIYKKLGALVYEQGLTVDSIESQVEQTSIFVSQ
GTENLRKASSYRNKVRKKKLILVGILSAVLLAIILILVFQFKN

Take the obtained peptide sequence and search against D.mel protein database using BLASTP on NCBI (choose database "Reference protein (refseq_protein)" and enter "7227" into "Organism" option) and save the obtained best alignment in your project report.

>ref|NP_730632.1| UniGene info linked to NP_730632.1Gene info linked to NP_730632.1 syntaxin 7, isoform A [Drosophila melanogaster]
 ref|NP_730633.1| Gene info linked to NP_730633.1 syntaxin 7, isoform B [Drosophila melanogaster]
Length=282

 GENE ID: 36173 Syx7 | Syntaxin 7 [Drosophila melanogaster]
(Over 10 PubMed links)

 Score =  489 bits (1260),  Expect = 6e-139, Method: Compositional matrix adjust.
 Identities = 277/283 (98%), Positives = 280/283 (99%), Gaps = 1/283 (0%)

Query  1    MDLQHMENGLSGGGGGGGLSEIDFQRLAQIIATSIQKVQQNVSTMQRMVNQLNTPQDSPE  60
            MDLQHMENGLSGGGGGG  SEIDFQRLAQIIATSIQKVQQNVSTMQRMVNQLNTPQDSPE
Sbjct  1    MDLQHMENGLSGGGGGGL-SEIDFQRLAQIIATSIQKVQQNVSTMQRMVNQLNTPQDSPE  59

Query  61   LKKQLHQIMTYTNQLVTDTNNQINEVDKCKERHLKIQRDRLVDEFTAALTAFQAVQRKTA  120
            LKKQLHQIMTYTNQLVTDTNNQINEVDKCKERHLKIQRDRLVDEFTAALTAFQ+VQRKTA
Sbjct  60   LKKQLHQIMTYTNQLVTDTNNQINEVDKCKERHLKIQRDRLVDEFTAALTAFQSVQRKTA  119

Query  121  DIEKTALRQARGDSYNIARPPGSSRTGSSNSSASQQDNNSFFEDNFFNRKSNQQQLQTQM  180
            DIEKTALRQARGDSYNIARPPGSSRTGSSNSSASQQDNNSFFEDNFFNRKSNQQQ+QTQM
Sbjct  120  DIEKTALRQARGDSYNIARPPGSSRTGSSNSSASQQDNNSFFEDNFFNRKSNQQQMQTQM  179

Query  181  QEQVDLQALEEQEQVIRELENNIVGVNEIYKKLGALVYEQGLTVDSIESQVEQTSIFVSQ  240
            +EQ DLQALEEQEQVIRELENNIVGVNEIYKKLGALVYEQGLTVDSIESQVEQTSIFVSQ
Sbjct  180  EEQADLQALEEQEQVIRELENNIVGVNEIYKKLGALVYEQGLTVDSIESQVEQTSIFVSQ  239

Query  241  GTENLRKASSYRNKVRKKKLILVGILSAVLLAIILILVFQFKN  283
            GTENLRKASSYRNKVRKKKLILVGILSAVLLAIILILVFQFKN
Sbjct  240  GTENLRKASSYRNKVRKKKLILVGILSAVLLAIILILVFQFKN  282