D. erecta Annotation Procedure

Tools used for genome annotation

Download the project package from here.

Typical Eukaryotic Gene Structure:

Procedure of genome annotation:

  1. Identify the likely ortholog in D. mel using blastp on flybase
  2. Use D. mel. database to find gene model of ortholog and identify protein seq for each exon
  3. Use BLASTX to locate exons; search one by one, find conservation, note position and frame
  4. Based on locations, frames of conservation, as well as other evidence create gene model; identify the exact base location (start and stop) of each CDS (coding exon) for each isoform
  5. Confirm your model using Gene checker and genome browser.

More detailed procedure:

  1. You are provided a zip file named "derecta_3Lextended_Jan2008_fosmid9.zip". Unzip this file you will obtain a folder named "derecta_3Lextended_Jan2008_fosmid9". This folder has two subfolders: "analysis" and "src", and three files: two of them are project report files and one is a README file, from which you can get more description about the contents of the subfolders.

  2. Go to subfolder "src", get sequence named "fosmid9.fasta.masked". This is a plain text file. Use this masked genomic sequence when the genomic sequence is needed.

  3. Go to GEP Genome Browser, select "D. erecta" for genome, "Jan.2008 (GEP/3L extended)" for assembly and type "fosmid9" in the "position or search term" box, then click "submit" button.
  4. In the genome browser, in the "Mapping and Sequencing Tracks", turn the option for "Base Position" to "full" and click "refresh" button; in the "Genes and Gene Prediction Tracks", turn the option for "Genscan Genes" to "full and click "refresh" button; click one predicted gene by GenScan (for instance the one named as fosmid9.6), obtain the predicted protein sequence in the next page.

    Predicted Gene (fosmid9.6)
  5. Use blastp on the flybase website, choose "Annotated Protein" database, search the obtained predicted protein in the previous step against annotated D. mel protein database. Find the best match and determine the gene name in D. mel. Adjust blast parameters when needed.

    BLASTP on Flybase:

    BLASTP result for the predicted gene (fosmid9.6):

  6. Search this gene in the GEP gene record finder, find D. mel gene details (exon amino acids sequences, gene structure etc.)

    In this fosmid9.6 example, the D.mel gene symbol is "Syx7" and the gene record is shown below:

    This gene has two isoforms: Syx7-RA and Syx7-RB. Each isoform has 5 exons (2_307, 3_307, 4_307, 5_307, 6_307). For this particular example, these two isoforms of the gene has identical gene structure in the coding region. For some genes, different isoforms have different gene structure and you should be aware of which isoform you are work on (by default, the first isoform is selected).
    When you work on your gene, you are also suggested to get a screenshot of this gene record and put it into a word document. Then you click "Export Sequences for Selected Isoform to FASTA" tab to get the exon sequences of this isoform.

  7. Search the masked genomic sequence ("fosmid9.fasta.masked") against each amino acid sequence of exon in each isoform of D. mel using blastx on the NCBI website (The genomic sequence as query and each exon as subject. Make sure you change expect value to 0.1 and check off the "low complexity region"). Copy and paste the obtained alignment into the word document.
    BLASTX search:

    BLASTX search result summary:

    Alignment generated from BLASTX search:

    Clean up the blastx output by removing insignificant matches, and draw a picture of your gene structure based on the gene record from gene record finder and the blastx output. Here is the procedure:
    Assume you have five exons in the isoform of the gene you are working on.

    If "Frame" in blastx output is +1, +2, or +3


    If "Frame" in blastx output is -1, -2, or -3

    The arrow indicates the direction you should read the sequence. In this particular example, the picture for positive frame will be used.

  8. Determine precise exon boundary by using signals (ATG, GT, AG, TAA,TAG,TGA), phase, conservation and frame information in the genome browser

    If "Frame" in blastx output is +1, +2, or +3

    If "Frame" in blastx output is -1, -2, or -3

    In the genome browser, you will determine the correct boundaries of each exons at the neighborhood of matched regions given in the blastx alignment. The boundaries are determined by searching for the start codon (ATG), splice sites (GT, AG), stop codons (TAA, TAG, TGA), and matched phases. One specific example to example the whole procedure is given below.

  9. If the gene is one the reverse strand (frame -1, -2, or -3), make sure the arrow at the top left corner of the browser points to the left hand side. When you do the annotation, process every exon from right to left. When you read the signals, read them from right to left. Other than these, all the other steps are the same as described above.
  10. Sometimes you have a gene with one single exon. In this case, you only need to identify start codon (ATG) and stop codon (TAA/TAG/TGA) to determine the boundary of the exon.
    Forward strand:


    Reverse strand:


  11. Sometimes you may have a partial gene, instead of one complete gene in your sequence. In the case, you may notice that some blastx alignment is pretty bad. In that case, you only annotate the ones with good matches and ignore the ones with bad matches.
  12. Sometimes, you cannot find significant match when you search the predicted gene to the D.mel protein database. In this case, you just ignore that prediction gene and move to a different one. Most likely, this is a wrong prediction and it is not worth to perform further investigation.
  13. After the locations of all the exons of one isoform are determined, enter the obtained gene model into gene model checker to check the correctness of this obtained gene model. If this gene model passes the test, three files will be generated (nucleotide sequence of the gene, protein sequence of the gene, gene annotation) and the obtained gene model can be viewed in the genome browser. Collect all these results and put them into annotation project report.


    Click the magnifier at the last line of the checklist, a UCSC genome browser feature view will pop up and the gene you annotate will be shown as "Custom Gene Model" in the genome browser.


    Click the "Download" tab at the top right corner, you will be able to download three files, one is a GFF file (one standard gene annotation file); one is a nucleotide sequence of the gene; and one is a peptide sequence of the gene. Download these three files and put them into your project report.

    track name="CustomModel" description="Custom Gene Model" color=200,0,0 visibility=2
    fosmid9	GEP	CDS	34975	35098	.	+	0	gene_id "Syx7-PA"; transcript_id "Syx7-PA";
    fosmid9	GEP	CDS	35163	35232	.	+	2	gene_id "Syx7-PA"; transcript_id "Syx7-PA";
    fosmid9	GEP	CDS	35515	35659	.	+	1	gene_id "Syx7-PA"; transcript_id "Syx7-PA";
    fosmid9	GEP	CDS	35741	36157	.	+	0	gene_id "Syx7-PA"; transcript_id "Syx7-PA";
    fosmid9	GEP	CDS	36213	36305	.	+	0	gene_id "Syx7-PA"; transcript_id "Syx7-PA";
    fosmid9	GEP	stop_codon	36306	36308	.	+	.	gene_id "Syx7-PA"; transcript_id "Syx7-PA";
    fosmid9	GEP	exon	34975	35098	.	+	.	gene_id "Syx7-PA"; transcript_id "Syx7-PA";
    fosmid9	GEP	exon	35163	35232	.	+	.	gene_id "Syx7-PA"; transcript_id "Syx7-PA";
    fosmid9	GEP	exon	35515	35659	.	+	.	gene_id "Syx7-PA"; transcript_id "Syx7-PA";
    fosmid9	GEP	exon	35741	36157	.	+	.	gene_id "Syx7-PA"; transcript_id "Syx7-PA";
    fosmid9	GEP	exon	36213	36305	.	+	.	gene_id "Syx7-PA"; transcript_id "Syx7-PA";
    

    >Syx7-PA_transcript
    ATGGACTTGCAGCATATGGAGAATGGCCTAAGTGGCGGGGGCGGAGGGGGTGGTCTTAGC
    GAAATAGATTTCCAAAGGCTGGCCCAGATTATAGCCACCAGCATCCAGAAGGTGCAGCAG
    AATGTGTCCACGATGCAGCGCATGGTCAATCAACTAAACACGCCCCAGGATTCCCCGGAG
    CTAAAAAAGCAACTCCACCAAATAATGACCTACACCAACCAGCTAGTGACCGACACAAAC
    AATCAAATCAACGAGGTGGACAAGTGCAAGGAGCGCCATCTGAAGATCCAGCGGGATAGG
    CTCGTGGACGAGTTCACGGCGGCACTGACCGCCTTCCAGGCCGTCCAGCGCAAAACGGCG
    GACATAGAGAAGACGGCGTTGCGGCAGGCGCGCGGAGATAGCTACAACATCGCCCGTCCA
    CCCGGCTCATCGCGTACCGGCAGCTCCAACAGCAGCGCCAGCCAGCAGGACAACAACTCA
    TTCTTTGAGGACAACTTCTTCAATCGCAAATCAAACCAGCAACAACTGCAGACTCAGATG
    CAGGAGCAGGTGGACCTGCAGGCCCTCGAGGAACAAGAGCAGGTCATCCGGGAGCTTGAG
    AACAACATCGTGGGCGTGAACGAGATATACAAAAAGCTGGGCGCCCTGGTCTACGAACAG
    GGACTGACGGTGGACTCCATCGAGTCGCAGGTGGAACAGACTAGCATTTTCGTCTCACAG
    GGCACGGAAAATCTGCGCAAGGCGAGCTCTTACAGGAACAAAGTGCGAAAGAAGAAGCTG
    ATTTTGGTGGGCATCCTGAGCGCCGTGCTGCTGGCCATAATCTTGATACTCGTCTTTCAG
    TTCAAGAAC
    

    >Syx7-PA_peptide
    MDLQHMENGLSGGGGGGGLSEIDFQRLAQIIATSIQKVQQNVSTMQRMVNQLNTPQDSPE
    LKKQLHQIMTYTNQLVTDTNNQINEVDKCKERHLKIQRDRLVDEFTAALTAFQAVQRKTA
    DIEKTALRQARGDSYNIARPPGSSRTGSSNSSASQQDNNSFFEDNFFNRKSNQQQLQTQM
    QEQVDLQALEEQEQVIRELENNIVGVNEIYKKLGALVYEQGLTVDSIESQVEQTSIFVSQ
    GTENLRKASSYRNKVRKKKLILVGILSAVLLAIILILVFQFKN
    

    Take the obtained peptide sequence and search against D.mel protein database using BLASTP on NCBI (choose database "Reference protein (refseq_protein)" and enter "7227" into "Organism" option) and save the obtained best alignment in your project report.

    >ref|NP_730632.1| UniGene info linked to NP_730632.1Gene info linked to NP_730632.1 syntaxin 7, isoform A [Drosophila melanogaster]
     ref|NP_730633.1| Gene info linked to NP_730633.1 syntaxin 7, isoform B [Drosophila melanogaster]
    Length=282
    
     GENE ID: 36173 Syx7 | Syntaxin 7 [Drosophila melanogaster]
    (Over 10 PubMed links)
    
     Score =  489 bits (1260),  Expect = 6e-139, Method: Compositional matrix adjust.
     Identities = 277/283 (98%), Positives = 280/283 (99%), Gaps = 1/283 (0%)
    
    Query  1    MDLQHMENGLSGGGGGGGLSEIDFQRLAQIIATSIQKVQQNVSTMQRMVNQLNTPQDSPE  60
                MDLQHMENGLSGGGGGG  SEIDFQRLAQIIATSIQKVQQNVSTMQRMVNQLNTPQDSPE
    Sbjct  1    MDLQHMENGLSGGGGGGL-SEIDFQRLAQIIATSIQKVQQNVSTMQRMVNQLNTPQDSPE  59
    
    Query  61   LKKQLHQIMTYTNQLVTDTNNQINEVDKCKERHLKIQRDRLVDEFTAALTAFQAVQRKTA  120
                LKKQLHQIMTYTNQLVTDTNNQINEVDKCKERHLKIQRDRLVDEFTAALTAFQ+VQRKTA
    Sbjct  60   LKKQLHQIMTYTNQLVTDTNNQINEVDKCKERHLKIQRDRLVDEFTAALTAFQSVQRKTA  119
    
    Query  121  DIEKTALRQARGDSYNIARPPGSSRTGSSNSSASQQDNNSFFEDNFFNRKSNQQQLQTQM  180
                DIEKTALRQARGDSYNIARPPGSSRTGSSNSSASQQDNNSFFEDNFFNRKSNQQQ+QTQM
    Sbjct  120  DIEKTALRQARGDSYNIARPPGSSRTGSSNSSASQQDNNSFFEDNFFNRKSNQQQMQTQM  179
    
    Query  181  QEQVDLQALEEQEQVIRELENNIVGVNEIYKKLGALVYEQGLTVDSIESQVEQTSIFVSQ  240
                +EQ DLQALEEQEQVIRELENNIVGVNEIYKKLGALVYEQGLTVDSIESQVEQTSIFVSQ
    Sbjct  180  EEQADLQALEEQEQVIRELENNIVGVNEIYKKLGALVYEQGLTVDSIESQVEQTSIFVSQ  239
    
    Query  241  GTENLRKASSYRNKVRKKKLILVGILSAVLLAIILILVFQFKN  283
                GTENLRKASSYRNKVRKKKLILVGILSAVLLAIILILVFQFKN
    Sbjct  240  GTENLRKASSYRNKVRKKKLILVGILSAVLLAIILILVFQFKN  282