|What are pseudogenes?
Pseudogenes are genomic DNA sequences similar to coding genes but without coding potential. They are regarded as defunct relatives of functional genes.
|What causes pseudogenes to arise?
There are two accepted processes during which pseudogenes may arise:
- duplication - modifications (mutations, insertions, deletions,
frame shifts) to the DNA sequence of a gene can occur during duplication.
These disablements can result in loss of gene function at the transcription
level (or both) since the sequence no longer results in the production of
a protein. Copies of genes that are disabled in such a manner are termed
non-processed or duplicated
- retrotransposition - reverse transcription of an mRNA transcript
subsequent re-integration of the cDNA into the genome. Such copies of genes
processed pseudogenes. These pseudogenes can also accumulate
disablements over the course of evolution.
Click here for a graphical illustration of this definition.
|Why are pseudogenes interesting?
In any study of molecular evolution, it is necessary to compare and contrast genes from a variety of organisms to gauge how the organisms have adapted to ensure their survival. Pseudogenes are vitally important since they provide a record of how the genomic DNA has been changed without such evolutionary pressure and can be used as a model for determining the underlying rates of nucleotide substitution, insertion and deletion in the greater genome.
In addition, due to the high sequence identity to their functional paralogs, pseudogenes introduce artifacts into multiple steps of next generation sequencing analyses, such as PCR/target enrichment and reads mapping. It is hence necessary to differentiate the signals from each origin to derive the true results. Also, recently it has been found that some pseudogenes are actually functional through their RNA products, which naturally leads to the question that whether these findings are sporadic exceptions, or reflect a more general mechanism in organisms.
|How can a pseudogene be identified computationally?
Once gene sequences have been identified in the genome, it is possible to
use sequence alignment programs (such as FASTA or BLAST) to detect matching
regions in the nucleotide sequence. These matching regions are potential gene
homologs and are termed pseudogenes if there is some evidence that either of
the causes (see above) are satisfied.
In these analyses, genes from annotated genomes and protein databases have
first been clustered into paralog families and then used to survey whole
genomes for copies or homologs. For each potential pseudogene (or fragment)
match, a number of steps have been taken to assess its validity as a
pseudogene. These steps include checking for overcounting and repeat elements,
overlap on the genomic DNA with other homologs and cross-referencing with exon
assignments from genome annotations. The resulting pseudogenes or pseudogenic
fragments have then been assigned to the paralog family of the most homologous
gene (or assigned to a singleton gene if the probe gene has no obvious paralog).
|Relating pseudogenes to known protein structures
In a number of cases, more distant evolutionary and functional relationships
between proteins can only be elucidated through the analysis of the folds that
their structures adopt. While it must not be forgotten that the assignment of
function to a gene is often implied from that of a gene with a homologous
sequence, the added information that protein structures can provide is very
desirable in genome annotation.
In the case of pseudogenes, structural information can give extra
evolutionary clues and facilitate analysis of the scope of folds in the
pseudogene population ("pseudo"-folds) in contrast to those observed for the
genes themselves. Where possible, i.e. where a gene can be matched to a SCOP
domain, assignment of fold to a pseudogene or pseudogenic fragment is based
upon the assignment of the most homologous gene.
Our initial goal was to survey some eukaryotic genomes for pseudogene
sequences and fragments of pseudogene sequences. In addition to this, we have
also quantified "pseudo-fold" usage, amino-acid composition, and
single-nucleotide polymorphisms (SNPs) to help elucidate the relationships
between pseudogene families across these organisms.
More recently, as part of ENCODE and modENCODE contria, we focus on comparative analysis of pseudogene annotation and activitiy across human and other model organisms like mouse, worm and fly.
We also study the pseudogenization events in individual genomes.
PI: Dr. Mark Gerstein