Evolution of protein domain repeats
- This is a PLOS Topic Page draft
Public peer review comments will be posted here
|About the Authors
Protein domain repeats are evolutionarily related units that occur in tandem within a protein. Many proteins are composed of multiple protein domains, functional units of common origin. One variety of domain combination that is particularly frequent among Metazoa, is tandem domain repeats. These are stretches of domains from the same family, situated next to each other in a protein. Certain properties characterize these domains. First, they are often quite short, often less than 50 residues, and, second, they tend to be highly variable with only a few residues that are crucial for the functionality of the domain. Structurally, repeat domains are diverse and may form modular structures on their own or form larger filaments where each repeat is dependent on other repeats for its functionality. Their sequences are malleable, both with regard to the repeating unit and in the number of repeats, and they therefore provide flexible binding to many partners.
Many repeat proteins expand through duplications of neighbouring domains but for some repeat domains low similarity between adjacent domains have been observed. Causes for this observed pattern that have been proposed include aggregation prevention, as adjacent highly similar domains may aggregate, and constraints imposed on the repeat protein by its binding partners. It is possible, however, that it is simply a propagating pattern initially derived through random drift. Further, while multi-domain proteins evolve by single domain insertions/deletions at the N- or C-termini, repeat domains tend to expand through internal tandem duplications, where several units at a time are duplicated. Additionally, repeat proteins often expand to more than twenty repeating units within one protein, possibly through homologous recombination.
Protein domains are structural, functional and evolutionary building blocks that, within one protein, can form various architectures that may be composed of one or several domains,. One subset of such domain architectures is domain repeats, i.e. strings of the same class of domains repeated after one another (tandem repeats), as for instance ankyrin repeats (see figure 1). These domains are often short, such as the ankyrin repeat of ~33 residues, and their sequences are highly varied, where typically only a motif is retained. The latter property confounds the automated assignment of repeat domains, since their signatures are often only a small part of the sequence, such in the case of the common HEAT repeat that has on average around 13% sequence identity. As more sequences have become available, and the fraction of protein domain repeats is substantial, several methods to identify repeats have been devised. One method that is commonly used is using hmmer with relaxed e-values for repeating regions. However, de novo methods also exist, such as HHpredID, a method based on HMM-HMM comparisons.
The study of repeat proteins started with Margaret Dayhoff's observations of internal gene duplications in 1978 and the structurally confirmed observation of tandem domains in acid proteases. Many of these duplications take place within repeat proteins, generating mutations that are often associated with disease. The perhaps best known case is huntingtin, a protein that contains many HEAT domain repeats preceeded by a trinucleotide repeat, the expansion of which causes Huntington's disease. However, while the disease-causing trinucleotide repeats of huntingtin are short, many proteins contain repeats encompassing entire protein domains (protein domain repeats). Such domain repeats were found to be important as structural components of virus proteins, such as the shaft repeat of the adenovirus fiber protein and an accumulation of ankyrin repeats in poxviruses has been observed. Indeed, structural roles are common for repeat proteins, and some of the best studied specimens play important roles as cytoskeleton crosslinkers, e.g. spectrin.
The tandem repeats of several proteins have been linked to complex diseases, such as cancer, as for instance the polyglycine repeats of the androgen receptor. Further, leucine rich repeat (LRR) repeats may play a role in Parkinson's disease. Clearly, many repeat domains are of medical importance as exemplified by the immunoglobulin domain, the domain that is the main structural component of antibodies. In recent years, it has become evident that other protein domain repeats may also be used as protein scaffolds capable of specific protein binding. Indeed, the LRR repeat is the main component of the adaptive immune system of jawless vertebrates and enables plants to adjust to new pathogens. Using alternative repeat domains for specific protein binding may allow optimization of the biophysical properties of protein scaffolds.
The many cellular roles of repeat proteins
There are several different classes of protein domain repeats ranging from those that fold independently, to those repeats that aggregate. Other repeats, forming elongated structures, can only fold in the context of similar repeats, sometimes forming superhelical structures that are stabilized by adjacent repeat domains. A further distinctive characteristic of repeat domains is their tendency to, unlike most other domains, have little direct interaction between sequentially distant residues.
Repeat proteins have diverse functions, but are particularly common in cell cycle regulation, transcriptional regulation, protein transport, protein folding or in structural roles. Although fairly rare in prokaryotes, repeat proteins in these organisms tend to be located on the cell surface, often playing an important role in pathogenesis. In eukaryotes, many repeat proteins help shape cellular structure, such in the cases of filamin, spectrin and titin. Other domain repeats are important in complex formation through flexible binding surfaces. Ankyrin repeats, for example, mediate many different protein-protein interactions and is a scaffold that evolves continuously toward the binding of new ligands. Indeed, repeat proteins are common among the hubs in the protein-protein interaction networks  .
Repeat proteins are longer than other proteins, and, indeed, constitute some of the proteins that shape the cytoskeleton. One example of this is titin, one of the largest known proteins, with hundreds of immunoglobulin repeats in tandem. Many of the repeat proteins that are important for the function of the cytoskeleton have cellular roles that were most likely cemented at the dawn of the vertebrate lineage, such as filamin, a mechanotransductor important for signalling that exists in three closely related paralogs in all sequenced vertebrates.
|β-propeller||3,521||Diverse||40||Four stranded β-sheet||32||CL0186|
|Helix-turn-helix||2,376||DNA-binding||57||DNA/RNA-binding 3 helical bundle||12||CL0123|
|C2H2 zinc finger||1,652||DNA-binding||23||β-β-α zinc-finger||32||CL0361|
|PAS domain clan||1,451||Signalling||102||Profilin-like||16||CL0183|
|Ankyrin repeat||1,298||Protein-protein interaction||33||α-α||52||PF00023|
|Ig-like fold||1,190||Diverse||83||Immunoglobulin-like β-sandwich||64||CL0159|
|OB fold||938||Oligonucleotide/oligosaccharide binding||71||OB-fold (Barrel)||13||Cl0021|
|DNA gyrase C-terminal||928||DNA-binding||50||β-propeller||4||PF03989|
|DNA clamp||894||DNA-binding||120||DNA clamp||4||CL0060|
|S-layer homology domain||605||Cell surface||44||-||43||PF00395|
|Mitochondrial carrier||601||Substrate carriers||95||Mitochondrial carrier||4||PF00153|
|POTRA domain||584||Hypothetical chaperone-like||78||-||8||CL0191|
|Immunoglobulin superfamily||456||Diverse||90||Immunoglobulin-like β-sandwich||69||CL0011|
|EF-hand like superfamily||434||Calcium-binding||30||EF-hand like||10||CL0220|
|PASTA domain||360||Beta-lactam binding||62||Penicillin-binding protein||5||PF03793|
Proteins evolve both through mutations involving one or a few residues and by domain rearrangements. The latter are comparatively well tolerated since, in many cases, protein domains perform modular functions. Repeat proteins have high variability with regard to the number of repeats in the protein. They differ from other proteins in the sense that they tend to expand through internal duplications rather than domain shuffling. A likely scenario, is that repeat proteins expand rapidly until a physical/structural limit has been reached and subsequently diverge rapidly since repeat domains tend to only have weak sequence similarity. One possible explanation for their propensity is that their structures allow expansion and, additionally, may provide novel ligand binding.
Particularly higher eukaryotes are prone to rapid repeat expansion, as is immediately obvious from the abundance of repeats in eukaryotic genomes compared to prokaryotic. The three muscle proteins titin, twitchin and nebulin provide a few extreme examples of repeat expansion. For nebulin, the vertebrate lineage has seen rapid expansion of four proteins, figure 2 where the largest protein is composed of more than two hundred domains.
The immunoglobulin and filamin repeats, which share the β-sandwich fold, exhibit a pattern where roughly every other domain shows high sequence similarity. This pattern is probably the result of a divergence of adjacent domains and subsequent duplications of the resulting pairs. Although such patterns may evolve by chance as the duplication of the diverged domain pair propagates, functional explanations for these dissimilar adjacent domains have been suggested. For instance, a lack of similarity between adjacent domains may serve to prevent aggregation, as suggested by Wright and coworkers. Alternatively, functional constraints set by other proteins in the interaction where the repeat domain is involved may cause this patter, as in the case of filamin.
Although different domains evolve through internal duplications of different sized cassettes, the most common cassette size is two and the most common location of duplication is in the middle , see figure 3. Intriguingly, the abundant EGF and immunoglobulin domains have enriched exon junctions in domain boundaries, thus suggesting a role for exon shuffling in repeat expansion for these domains. Other domains, such as the eukaryote specific C2H2 zinc finger domain shows a very different pattern where the numerous repeating domains are contained within one giant exon .
Clearly, internal repeat duplications most frequently involve more than one domain, figure 4. In some cases, repeat proteins contain another level of repeats, namely super-repeats units that, in an already repetitive sequence, show internal similarity in larger bocks. Expansions such as these are easily detected through inspection of domain similarity matrices where parallel lines of very similar domains, all seven domains apart, see figure 5 and figure 4, are clearly visible. As Figure figure 6 shows, in nebulin, the duplications are consistently taking place in cassettes of seven units. A similar super-repeat is found in the skeletal muscle isoform of titin.
Although it is not the only mechanism behind protein domain repeat expansions, homologous recombination is likely to be important since any region of homology between two sequences may cause homologous recombination and subsequent tandem duplication. Indeed, larger repeating regions tend to be duplicated. Further, duplications in more malleable regions, such as those that contain domains that have a repeat--forming characteristic, are less likely to have a detrimental effect on protein function. In fact, an increase in the number of repeated domains might not, to any great extent, alter the protein structure and can promote protein stability.
Repeat proteins and complexity
Higher complexity is associated with larger genomes and an abundance of sequence repeats. Indeed, protein domain repeats are more common in complex organisms . For coding repeats there are likely to be additional constraints affecting their abundance aside from constraints on genome size. In particular, protein domain repeats are generally uncommon in prokaryotes . This may be explained by the prokaryotic lack of the sophisticated protein synthesis machinery, including the endoplasmatic reticuluum and golgi, that allows eukaryotes to handle the multi-domain, non-globular folds that characterize repeat proteins.
While nebulin and titin are extreme examples of protein domain repeats, forming enormous proteins that are essential for cellular structure, most vertebrates contain a large number of repeat proteins that have between 3 and twenty consecutive domains. An estimated 17% of the human proteome consists of protein domain repeats while the corresponding number is around 5% for prokaryotes , and indeed unicellular organisms in general. Although the reason for the predominance of protein domain repeats in Metazoa is not clear, it is possible that repeats provide another source of variability in compensation for low generation rates. Further, studies show that protein domain repeats are comparatively recent development in genome evolution, since ancient proteins tend to have few repeats. It should also be noted that certain Metazoa, such as Drosophila melanogaster, are comparable to unicellular organisms with regard to repeat proteins .
- Marcotte (1999), for an important review of repeat proteins.
- Boersma (2011), for a review of the biotechnological applications of protein repeat engineering.
- Kajava (2011), for a recent review of the structural classes of repeat proteins.
This work was supported by grants from the Swedish Research Council by grants to AE, SSF (the Foundation for Strategic Research). SL was financed by Bioinformatics Infrastructure for Life Sciences. The authors gratefully acknowledge Dr. Åsa K. Björklund for assistance with figure preparation and, further, the Journal of Molecular Biology for permission to reuse images previously published.
- ^ Michaely P, Tomchick DR, Machius M & Anderson RG (2002) Crystal structure of a 12 ANK repeat stack from human ankyrinR EMBO J. 21:6387-96 [PMID: 12456646]
- ^ Kobe B & Deisenhofer J (1995) A structural basis of the interactions between leucine-rich repeats and protein ligands Nature 374:183-6 [PMID: 7877692][DOI]
- ^ Rossmann MG, Moras D & Olsen KW (1974) Chemical and biological evolution of nucleotide-binding protein Nature 250:194-9 [PMID: 4368490]
- ^ Murzin AG, Brenner SE, Hubbard T & Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures J. Mol. Biol. 247:536-40 [PMID: 7723011][DOI]
- ^ Andrade MA, Petosa C, O'Donoghue SI, Müller CW & Bork P (2001) Comparison of ARM and HEAT protein repeats J. Mol. Biol. 309:1-18 [PMID: 11491282][DOI]
- ^ Eddy SR (1998) Profile hidden Markov models Bioinformatics 14:755-63 [PMID: 9918945]
- ^ a b c d e f g h i Björklund AK, Ekman D & Elofsson A (2006) Expansion of protein domain repeats PLoS Comput. Biol. 2:e114 [PMID: 16933986][DOI]
- ^ Biegert A & Söding J (2008) De novo identification of highly diverged protein repeats by probabilistic consistency Bioinformatics 24:807-14 [PMID: 18245125][DOI]
- ^ Barker WC, Ketcham LK & Dayhoff MO (1978) A comprehensive examination of protein sequences for evidence of internal gene duplication J. Mol. Evol. 10:265-81 [PMID: 633380]
- ^ Tang J, James MN, Hsu IN, Jenkins JA & Blundell TL (1978) Structural evidence for gene duplication in the evolution of the acid proteases Nature 271:618-21 [PMID: 24179]
- ^ Andrade MA & Bork P (1995) HEAT repeats in the Huntington's disease protein Nat. Genet. 11:115-6 [PMID: 7550332][DOI]
- ^ (1993) A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes. The Huntington's Disease Collaborative Research Group Cell 72:971-83 [PMID: 8458085]
- ^ Green NM, Wrigley NG, Russell WC, Martin SR & McLachlan AD (1983) Evidence for a repeating cross-beta sheet structure in the adenovirus fibre EMBO J. 2:1357-65 [PMID: 10872331]
- ^ Bork P (1993) Hundreds of ankyrin-like repeats in functionally diverse proteins: mobile modules that cross phyla horizontally? Proteins 17:363-74 [PMID: 8108379][DOI]
- ^ a b c Apic G, Gough J & Teichmann SA (2001) Domain combinations in archaeal, eubacterial and eukaryotic proteomes J. Mol. Biol. 310:311-25 [PMID: 11428892][DOI]
- ^ Speicher DW & Marchesi VT (1984) Erythrocyte spectrin is comprised of many homologous triple helical segments Nature 311:177-80 [PMID: 6472478]
- ^ Warfel NA & Newton AC (2012) Pleckstrin homology domain leucine-rich repeat protein phosphatase (PHLPP): a new player in cell signaling J. Biol. Chem. 287:3610-6 [PMID: 22144674][DOI]
- ^ McEwan IJ (2001) Structural and functional alterations in the androgen receptor in spinal bulbar muscular atrophy Biochem. Soc. Trans. 29:222-7 [PMID: 11356158]
- ^ Paisán-Ruíz C, Jain S, Evans EW, Gilks WP, Simón J, van der Brug M, López de Munain A, Aparicio S, Gil AM, Khan N, Johnson J, Martinez JR, Nicholl D, Carrera IM, Pena AS, de Silva R, Lees A, Martí-Massó JF, Pérez-Tur J, Wood NW & Singleton AB (2004) Cloning of the gene containing mutations that cause PARK8-linked Parkinson's disease Neuron 44:595-600 [PMID: 15541308][DOI]
- ^ Zimprich A, Biskup S, Leitner P, Lichtner P, Farrer M, Lincoln S, Kachergus J, Hulihan M, Uitti RJ, Calne DB, Stoessl AJ, Pfeiffer RF, Patenge N, Carbajal IC, Vieregge P, Asmus F, Müller-Myhsok B, Dickson DW, Meitinger T, Strom TM, Wszolek ZK & Gasser T (2004) Mutations in LRRK2 cause autosomal-dominant parkinsonism with pleomorphic pathology Neuron 44:601-7 [PMID: 15541309][DOI]
- ^ a b Boersma YL & Plückthun A (2011) DARPins and other repeat protein scaffolds: advances in engineering and applications Curr. Opin. Biotechnol. 22:849-57 [PMID: 21715155][DOI]
- ^ Yadid I & Tawfik DS (2011) Functional β-propeller lectins by tandem duplications of repetitive units Protein Eng. Des. Sel. 24:185-95 [PMID: 20713410][DOI]
- ^ Pancer Z & Cooper MD (2006) The evolution of adaptive immunity Annu. Rev. Immunol. 24:497-518 [PMID: 16551257][DOI]
- ^ Ellis J, Dodds P & Pryor T (2000) Structure, function and evolution of plant disease resistance genes Curr. Opin. Plant Biol. 3:278-84 [PMID: 10873844]
- ^ a b Kajava AV (2012) Tandem repeats in proteins: from sequence to structure J. Struct. Biol. 179:279-88 [PMID: 21884799][DOI]
- ^ a b c d e f g h Marcotte EM, Pellegrini M, Yeates TO & Eisenberg D (1999) A census of protein repeats J. Mol. Biol. 293:151-60 [PMID: 10512723][DOI]
- ^ Ferreiro DU & Komives EA (2007) The plastic landscape of repeat proteins Proc. Natl. Acad. Sci. U.S.A. 104:7735-6 [PMID: 17483477][DOI]
- ^ Main ER, Jackson SE & Regan L (2003) The folding and design of repeat proteins: reaching a consensus Curr. Opin. Struct. Biol. 13:482-9 [PMID: 12948778]
- ^ D'Andrea LD & Regan L (2003) TPR proteins: the versatile helix Trends Biochem. Sci. 28:655-62 [PMID: 14659697][DOI]
- ^ Deivanayagam CC, Rich RL, Carson M, Owens RT, Danthuluri S, Bice T, Höök M & Narayana SV (2000) Novel fold and assembly of the repetitive B region of the Staphylococcus aureus collagen-binding surface protein Structure 8:67-78 [PMID: 10673425]
- ^ Yeats C, Finn RD & Bateman A (2002) The PASTA domain: a beta-lactam-binding domain Trends Biochem. Sci. 27:438 [PMID: 12217513]
- ^ a b Kohl A, Binz HK, Forrer P, Stumpp MT, Plückthun A & Grütter MG (2003) Designed to be stable: crystal structure of a consensus ankyrin repeat protein Proc. Natl. Acad. Sci. U.S.A. 100:1700-5 [PMID: 12566564][DOI] Cite error: Invalid
<ref>tag; name "Kohl" defined multiple times with different content
- ^ Ekman D, Light S, Björklund AK & Elofsson A (2006) What properties characterize the hub proteins of the protein-protein interaction network of Saccharomyces cerevisiae? Genome Biol. 7:R45 [PMID: 16780599][DOI]
- ^ Dosztányi Z, Chen J, Dunker AK, Simon I & Tompa P (2006) Disorder and sequence repeats in hub proteins and their implications for network evolution J. Proteome Res. 5:2985-95 [PMID: 17081050][DOI]
- ^ Labeit S, Barlow DP, Gautel M, Gibson T, Holt J, Hsieh CL, Francke U, Leonard K, Wardale J & Whiting A (1990) A regular pattern of two types of 100-residue motif in the sequence of titin Nature 345:273-6 [PMID: 2129545][DOI]
- ^ Ehrlicher AJ, Nakamura F, Hartwig JH, Weitz DA & Stossel TP (2011) Mechanical strain in actin networks regulates FilGAP and integrin binding to filamin A Nature 478:260-3 [PMID: 21926999][DOI]
- ^ a b c Light S, Sagit R, Ithychanda SS, Qin J & Elofsson A (2012) The evolution of filamin-a protein domain repeat perspective J. Struct. Biol. 179:289-98 [PMID: 22414427][DOI]
- ^ a b c Andrade MA, Petosa C, O'Donoghue SI, Müller CW & Bork P (2001) Comparison of ARM and HEAT protein repeats J. Mol. Biol. 309:1-18 [PMID: 11491282][DOI] Cite error: Invalid
<ref>tag; name "Andrade2001" defined multiple times with different content
- ^ Looman C, Abrink M, Mark C & Hellman L (2002) KRAB zinc finger proteins: an analysis of the molecular mechanisms governing their increase in numbers and complexity during evolution Mol. Biol. Evol. 19:2118-30 [PMID: 12446804]
- ^ a b Kenny PA, Liston EM & Higgins DG (1999) Molecular evolution of immunoglobulin and fibronectin domains in titin and related muscle proteins Gene 232:11-23 [PMID: 10333517]
- ^ Higgins DG, Labeit S, Gautel M & Gibson TJ (1994) The evolution of titin and related giant muscle proteins J. Mol. Evol. 38:395-404 [PMID: 8007007]
- ^ McElhinny AS, Kazmierski ST, Labeit S & Gregorio CC (2003) Nebulin: the nebulous, multifunctional giant of striated muscle Trends Cardiovasc. Med. 13:195-201 [PMID: 12837582]
- ^ Björklund AK, Light S, Sagit R & Elofsson A (2010) Nebulin: a study of protein repeat evolution J. Mol. Biol. 402:38-51 [PMID: 20643138][DOI]
- ^ a b Wright CF, Teichmann SA, Clarke J & Dobson CM (2005) The importance of sequence diversity in the aggregation and evolution of proteins Nature 438:878-81 [PMID: 16341018][DOI]
- ^ Patthy L (1999) Genome evolution and the evolution of exon-shuffling--a review Gene 238:103-14 [PMID: 10570989]
- ^ Koszul R & Fischer G (2009) A prominent role for segmental duplications in modeling eukaryotic genomes C. R. Biol. 332:254-66 [PMID: 19281956][DOI]
- ^ Tripp KW & Barrick D (2004) The tolerance of a modular protein to duplication and deletion of internal repeats J. Mol. Biol. 344:169-78 [PMID: 15504409][DOI]
- ^ Lavorgna G, Patthy L & Boncinelli E (2001) Were protein internal repeats formed by "bricolage"? Trends Genet. 17:120-3 [PMID: 11226587]