Previous PageTable Of ContentsNext Page

Analysis and functional annotation of expressed sequence tags for tef [Eragrostis tef (Zucc) Trotter]

Ju-Kyung Yu1, Mauricio La Rota1, Hugh Edwards1, Hailu Tefera2 and Mark E. Sorrells1

1 Department of Plant Breeding, Cornell University, Ithaca, NY14853, USA, Email,,,
Debre Zeit Agricultural Research Center, P.O. Box 32, Debre Zeit, Ethiopia, Email


Tef [Eragrostis tef (Zucc) Trotter] is the major cereal crop in Ethiopia constituting about 2/3rds of that nation’s diet. Tef is an allotetraploid (2n = 4x = 40) with a genome size of 730 Mbp and belongs to the family Poaceae. A total of 3,230 ESTs (Expresses Sequence Tags) were generated from tef cDNA libraries as a first step towards a sequence database for this organism. Sequences were generated from four cDNA libraries; seedling leaf, seedling root and inflorescence of Eragrostis tef, and seedling leaf of Eragrostis pilosa, a wild relative of Eragrostis tef. Clustering of the sequences among libraries resulted in 535 clusters (comprising 42% of the ESTs) and 1873 singletons. Of the assembled 2,408 sequences, 25% did not match any existing sequences in public databases. Annotation of the assembled sequences associated 57% of the putative identified tef genes with six major biological roles. Investigation of the translated assembled sequences for conserved protein domains revealed 175 Pfam domains. A total of 170 ESTs (5.3%) containing simple sequence repeats were identified. A collection of 1,425 ESTs from two libraries, seedling leaves of Eragrostis tef and Eragrostis pilosa, was assembled to identify single nucleotide polymorphisms (SNPs) and 37 contigs were found to contain one or more possible SNPs. The EST data generated in this study will be a valuable resource i) to identify native transcribed tef sequences, ii) to develop PCR-based markers and iii) to compare transcribed sequences with those from related cereal crops such as wheat, rice and barley.

Media summary

The first large-scale tef EST database developed by US and Ethiopian scientists will facilitate modern agricultural research toward tef genetic improvement in Ethiopia.

Key Words

tef, Ethiopian cereal crop, expressed sequence tags (EST), molecular marker


Tef [Eragrostis tef (Zucc.) Trotter] belongs to the grass family Poaceae and genus Eragostis which contains about 350 species, and the closest relative to tef is E. pilosa. Ethiopia is the center of origin and diversity of this species (Vavilov 1951). Tef is an allotetraploid with a base chromosome number of 10 (2n=4x=40) and has an estimated genome size 730 Mbp, with the smallest chromosomes ever reported among Poaceae family members (Ayele et al. 1996). It is one of the major staple cereal crops of Ethiopia, contributing 2/3rds of nation’s diet and is cultivated on about 2 Mha/yr with an average production of less than 1 t/ha. Despite its importance as a cereal crop, very little genetic information is known because of its confinement to Ethiopia. cDNA sequences, known as expressed sequence tags (ESTs) have become the method of choice for the rapid and cost-effective generation of data on the coding regions of genomes in a wide range of organisms. In plants, this method was initially used for the model species Arabidopsis thaliana (Hfte et al. 1993) and rice (Yamamoto and Sasaki 1997). Since then, many more plant EST sequences from a large variety of species have been deposited in dbEST ( These have proved useful in a number of ways: i) molecular marker development (Gupta et al. 2003; Somers et al. 2003: Thiel et al. 2003), ii) the construction of genetic and physical maps (Qi et al. 2003; Wu et al. 2002; Yu et al. 2004a), iii) comparative mapping (Sorrells et al. 2003; Yu et al. 2004b), and iv) gene discovery (Wang et al. 2001). The primary goals of this research were i) to develop database of ESTs, ii) to annotate functions of ESTs and iii) to develop EST sequences and PCR-based molecular markers.


Plant material

Tissues were obtained from seedling leaf, seedling root and inflorescence of Eragrostis tef (variety, Kaye Murri) and seedling leaf of Eragrostis pilosa (Accession 30-5). Plants were grown in the greenhouse (80 F) at Cornell University, Ithaca, NY, US. Tissues of leaf and root were harvested from 4 week old seedlings. Inflorescence tissues (whole spike) were harvested at 5 to 15 days after pollination. cDNA libraries were constructed by BIOS&T Inc. (Montreal, Quebec, Canada) using the four different tissues provided and individual clones were sequenced on a 377 ABI sequencer.

Sequence analysis

Edited ESTs were assembled into clusters using the PHRAP software. Assembled sequences were assigned a putative function using BLASTX against the GenBank non-redundant database and biological roles were annotated using the GO program against protein sequence database of Arabidopsis and rice. The program Pfam, for protein domain analysis was used. For mining SSRs and SNPs in EST database, MISA and SNPF softwares were utilized, respectively.


From four libraries, a total of 3,673 5’-end sequences were generated (Table 1). After trimming low-quality (PHRED quality ≥ 10) and vector sequences and removing contaminant sequences the resulting data set contained 3,230 high-quality, non-redundant ESTs (within a library) with a minimum of 100 bases and an average of 554 bp in length. The 3,230 ESTs were assembled into 535 contigs and 1,873 singletons using the PHRAP program (min_match = 50) (Table 1). Of the four libraries, the seedling root library had the greatest percentage (69%) of singletons. The assembled 2,408 sequences were compared to the GenBank non-redundant database using BLASTX (cut-off of ≤10-5) to assign putative function. Of all assembled sequences, 59% could be assigned a putative identity, while 25% presented no matches with existing sequences at an e value of 1E-5 (Table 1).

Table 1. Overview on tef EST analysis

Total sequences


Total high-quality sequences


Number of contigs


Number of singletons


Number of assembled sequences


Number of assembled sequences matching ‘putative functional genes’

1,409 (59%)

Number of assembled sequences matching ‘putative proteins

395 (16%)

Number of assembled sequences matching ‘unclassified / no hits’

604 (25%)

Assembled sequences were categorized with respect to functionally annotated genes in Arabidopsis and rice, and grouped into 6 broad categories of biological roles using the GO program (cut-off of ≤10-5). Almost 57% of sequences were annotated and the largest fraction of the transcripts with a putative identity coded for proteins involved in protein and amino acid metabolism (23.6%) whereas the functions of 27.5% transcripts were not classified (Figure 1). Figure 2 demonstrates the comparative distribution of functional categories among the classified genes from the genome of Arabidopsis, rice, and tef assembled sequences. The genes with ‘housekeeping’ roles such as protein and amino acid metabolism were over-represented, as would be suggested by their being present in more than one of the tissues examined.

Figure 1. The distribution of the tef assembled EST sequences in functional gene categories.

Figure 2. Comparison of the distribution of functional gene categories in the tef assembled EST sequences, A. thaliana genome and rice genome.

The 2,408 assembled sequences were translated using TranSeq, and the amino acid sequences were submitted for a domain search in the Pfam database. A total of 276 protein sequences (295 domains) contained at least one domain, totalling 175 different domains (Figure 3). A high percentage (88%) of sequences did not show significant similarity against the present public database.

Figure 3. The number of occurrences for the 10 most common Pfam domains in tef assembled EST sequences proteins.

One of the most valuable uses of an EST database is the ability to search for sequence polymorphism and then design primers for molecular markers. These polymorphisms are typically single nucleotide polymorphisms (SNP) or small insertion-deletions (INDEL). A collection of 1,425 ESTs from seedling leaves of Eragrostis tef (cv. Kaye Murri) and Eragrostis pilosa (tef mapping parents) was assembled into contigs and aligned to identify SNP and INDEL. A total of 31 SNPs and 6 INDELs were identified. Assembled sequences were screened to identify SSRs and a total of 170 ESTs contained SSRs (5.3%) with 144 EST-SSRs originating from singletons. Trinucleotide SSRs (49%) were the most abundant followed by dinucleotides (42%). The markers identified in EST sequences were used to design primer sets that can be used to directly map functional, expressed genes.


The first large scale tef EST database has been developed and analyzed; i) a total of 3,230 high-quality tef EST sequences from four cDNA libraries were generated and these ESTs were assembled into 2,408 putative transcripts, ii) of the 2,408 assembled sequences, 59% could be assigned a putative functional identity in public database, iii) annotation of the assembled sequences associated 57% of the putative identified tef genes with protein/amino acid metabolism, nucleotide metabolism, signal transduction, lipid metabolism, energy metabolism and cell cycle, iv) inspection of the translated assembled sequences of conserved protein domains revealed 276 amino acid sequences with 175 Pfam domains and v) a collection of ESTs for developing PCR based, molecular markers (SSR, SNP and INDEL) was generated. The EST data set developed in this work will provide a fundamental basic resource for the understanding of tef genetic and crop improvement.


Ayele M, Dolezel J, Van Duren M, Brunner H and Zapata-Arias FJ (1996) Flow cytometric analysis of nuclear genome of the Ethiopian cereal tef [Eragrostis tef (Zucc.) Trotter]. Genetics 98, 211-215.

Gupta PK, Rustgi S, Sharma S, Singh R, Kumar N and Balyan HS (2003) Transferable EST-SSR markers for the study of polymorphism and genetic diversity in bread wheat. Mol Genet Genomics 270, 315-323.

Hfte H, Desprez T, Amselem J, Chiapello H, Caboche M, Moisan A, Jourjon M-F, Charpenteau J-L, Berthomieu P, Guerrier D, Giraudat J, Quigley F, Thomas F, Yu D-Y, Mache R, Raynal M, Cooke R, Grellet F, Delseny M, Parmentier Y, Marcillac G, Gigot C, Fleck J, Philipps G, Axelos M, Bardet C, Tremousaygue D and Lescure B (1993) An inventory of 1152 expressed sequence tags obtained by partial sequencing of cDNAs from Arabidopsis thaliana. Plant J. 4, 1051-1061.

Qi LL, Echalier B, Friebe B and Gill BS (2003) Molecular characterization of a set wheat deletion stocks for use in chromosome bin mapping of ESTs. Funct. Integr. Genomics 3, 39-55.

Somers DJ, Kirkpatrick R, Moniwa M and Walsh A (2003) Mining single-nucleotide polymorphisms from hexaploid wheat ESTs. Genome 46, 431-437.

Sorrells ME, La Rota M, Bermudez-Kandianis CE, Greene RA, Kantety R, Munkvold JD, Miftahudin, Mahmoud A, Ma X, Gustafson PJ, Qi LL, Echalier B, Gill BS, Matthews DE, Lazo GR, Chao S, Anderson OD, Edwards H, Linkiewicz AM, Dubcovsky J, Akhunov ED, Dvorak J, Zhang D, Nguyen HT, Peng J, Lapitan NL, Gonzalez-Hernandez JL, Anderson JA, Hossain K, Kalavacharla V, Kianian SF, Choi DW, Close TJ, Dilbirligi M, Gill KS, Steber C, Walker-Simmons MK, McGuire PE and Qualset CO (2003) Comparative DNA sequence analysis of wheat and rice genomes. Genome Res, 13, 1817-1827.

Thiel T, Michalek W, Varshney RK and Graner A (2003) Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (Hordeum vulgare L.). Theor. Appl. Genet. 106, 411-422.

Vavilov NI (1951) The origin, variation immunity and breeding of cultivated plants, pp37-38. Ronald Press, New York

Wang Z, Taramino G, Yang D, Liu G, Tingey SV, Miao GH and Wang GL (2001) Rice ESTs with disease-resistance gene- or defense-response gene-like sequences mapped to regions containing major resistance genes or QTLs. Mol Genet Genomics 265, 301-310.

Wu J, Maehara T, Shimokawa T, Yamamoto S, Harada C, Takazaki Y, Ono N, Mukai Y, Koike K, Yazaki J, Fujii F, Shomura A, Ando T, Kono I, Waki K, Yamamoto K, Yano M, Matsumoto T and Sasaki T (2002) A comprehensive rice transcript map containing 6591 expressed sequence tag sites. The Plant Cell 14, 525-535.

Yamamoto K and Sasaki T (1997) Large-scale EST sequencing in rice. Plant Mol. Bio, 35:135-144

Yu J-K, Dake TM, Singh S, Benscher D, Li W, Gill B and Sorrells ME (2004a) Development and Mapping of EST-Derived Simple Sequence Repeat (SSR) Markers for Hexaploid Wheat. Genome (Accepted).

Yu J-K, La Rota M, Kantety RK and Sorrells ME (2004b) EST Derived SSR Markers for Comparative Mapping in Wheat and Rice. Mol Genet Genomics (Submitted).

Previous PageTop Of PageNext Page