Supplemental data for Targeted discovery of novel human exons by comparative genomics
Adam Siepel, Mark Diekhans, Brona Brejova, Laura Langton, Michael Stevens, Charles L.G. Comstock, Colleen Davis, Brent Ewing, Shelly Oommen, Christopher Lau, Hung-Chun Yu, Jianfeng Li, Bruce A. Roe, Phil Green, Daniela S. Gerhard, Gary Temple, David Haussler, and Michael R. Brent. Targeted discovery of novel human exons by comparative genomics. Genome Research, 2007.
NGFs
Novel gene fragments (NGFs) were created by merging overlapping RT-PCR sequences. Each has at least one novel exon.
- NGF coordinates for hg17 human genome assembly, in UCSC BED format
- Nucleotide sequences in fasta format
- Predicted peptide sequences in fasta format
Novel NGFCs
Using overlaps between RT PCR sequences, known genes, gene predictions and cDNAs, we have clustered NGFs to 563 clusters (NGFCs), roughly corresponding to individual genes (see supplemental data at Genome Research). 327 of these clusters are completely novel, not overlapping latest curated, cDNA-supported gene sets (RefSeq and Vega). Here is their list in various formats.
GO categories
We have assigned putative Gene Ontology (GO) categories to NGFs based on sequence homology. Peptide sequences generated from the NGFs were searched against a database of vertebrate amino acid sequences using BLASTP, with an E-value threshold of 10-5. The amino acid database was constructed from the translations of all RefSeq protein coding genes from the human, mouse, rat, cow, dog, and chicken genomes, as downloaded from the UCSC Genome Browser on November 18, 2006. Gene Ontology (GO) categories were assigned to the RefSeq genes in the amino acid database using Entrez Gene (Maglott et al., 2007). Each NGF was then assigned the GO categories of its highest-scoring BLASTP match that had at least one GO category, and of all other matches scoring within 5% of the best match.
Domains
Domain matches were identified for the NGF peptides by searching with Reverse PSI-BLAST (RPSBLAST) against the Conserved Domain Database (CDD) and the Pfam database (Marchler-Bauer et al., 2005), with an E-value threshold of 0.01.
Other related information
- More supplemental data at Genome Research
- Press coverage: Cornell Chronicle, ScienceDaily
Contact: Adam Siepel, Siepel Lab, Cold Spring Harbor Laboratory,
Cold Spring Harbor, NY 11724
Last update: 02/13/2008