(2022) Current opinion in genetics & development. 76, 101966. Abstract
Oligomeric proteins are central to cellular life and the duplication and divergence of their genes is a key driver of evolutionary innovations. The duplication of a gene coding for an oligomeric protein has numerous possible outcomes, which motivates questions on the relationship between structural and functional divergence. How do protein oligomeric states diversify after gene duplication? In the simple case of duplication of a homo-oligomeric protein gene, what properties can influence the fate of descendant paralogs toward forming independent homomers or maintaining their interaction as a complex? Furthermore, how are functional innovations associated with the diversification of oligomeric states? Here, we review recent literature and present specific examples in an attempt to illustrate and answer these questions.
A unified statistical potential reveals that amino acid stickiness governs nonspecific recruitment of client proteins into condensates(2022) Protein Science. 31, 7, e4361. Abstract
Membraneless organelles are cellular compartments that form by liquid–liquid phase separation of one or more components. Other molecules, such as proteins and nucleic acids, will distribute between the cytoplasm and the liquid compartment in accordance with the thermodynamic drive to lower the free energy of the system. The resulting distribution colocalizes molecular species to carry out a diversity of functions. Two factors could drive this partitioning: the difference in solvation between the dilute versus dense phase and intermolecular interactions between the client and scaffold proteins. Here, we develop a set of knowledge‐based potentials that allow for the direct comparison between stickiness, which is dominated by desolvation energy, and pairwise residue contact propensity terms. We use these scales to examine experimental data from two systems: protein cargo dissolving within phase‐separated droplets made from FG repeat proteins of the nuclear pore complex and client proteins dissolving within phase‐separated FUS droplets. These analyses reveal a close agreement between the stickiness of the client proteins and the experimentally determined values of the partition coefficients (R > 0.9), while pairwise residue contact propensities between client and scaffold show weaker correlations. Hence, the stickiness of client proteins is sufficient to explain their differential partitioning within these two phase‐separated systems without taking into account the composition of the condensate. This result implies that selective trafficking of client proteins to distinct membraneless organelles requires recognition elements beyond the client sequence composition.
Empirical potentials for amino acid stickiness and pairwise residue contact propensities are derived. These scales are unique in that they enable direct comparison of desolvation versus contact terms. We find that partitioning of a client protein to a condensate is best explained by amino acid stickiness.
The modular cell gets connected: Integrative molecular cell biology can be used to interpret networks beyond modules(2022) Science (American Association for the Advancement of Science). 375, 6585, p. 1093-1094 Abstract
To understand living cells and the transfers of mass, energy, and information underlying living processes, spatiotemporal relationships among networks of genes, their product RNA, and protein molecules need to be defined. It is these spatiotemporal relationships that will allow us to grasp how variations in the genome manifest cellular characteristics and how cells interact with their environments. There have been enormous efforts to bridge the abstract structures of biomolecular networks and the spatiotemporal relationships of their component molecules. On page 1143 of this issue, Cho et al. (1) describe “OpenCell,” a data resource and analysis roadmap that takes us closer to this aim. They provide a glimpse of the complex and surprising spatial organization of a living human cell, including the existence of a cellular space in which new functions evolve.
(2022) Science (American Association for the Advancement of Science). 375, 6585, p. 1093-1094 Abstract
Mutant libraries reveal negative design shielding proteins from supramolecular self-assembly and relocalization in cells(2022) Proceedings of the National Academy of Sciences - PNAS. 119, 5, e210111711. Abstract
Understanding the molecular consequences of mutations in proteins is essential to map genotypes to phenotypes and interpret the increasing wealth of genomic data. While mutations are known to disrupt protein structure and function, their potential to create new structures and localization phenotypes has not yet been mapped to a sequence space. To map this relationship, we employed two homo-oligomeric protein complexes in which the internal symmetry exacerbates the impact of mutations. We mutagenized three surface residues of each complex and monitored the mutations' effect on localization and assembly phenotypes in yeast cells. While surface mutations are classically viewed as benign, our analysis of several hundred mutants revealed they often trigger three main phenotypes in these proteins: nuclear localization, the formation of puncta, and fibers. Strikingly, more than 50% of random mutants induced one of these phenotypes in both complexes. Analyzing the mutant's sequences showed that surface stickiness and net charge are two key physicochemical properties associated with these changes. In one complex, more than 60% of mutants self-assembled into fibers. Such a high frequency is explained by negative design: charged residues shield the complex from self-interacting with copies of itself, and the sole removal of the charges induces its supramolecular self-assembly. A subsequent analysis of several other complexes targeted with alanine mutations suggested that such negative design is common. These results highlight that minimal perturbations in protein surfaces' physicochemical properties can frequently drive assembly and localization changes in a cellular context.
(2022) Frontiers in Molecular Biosciences. 8, 787510. Abstract
The identification of physiologically relevant quaternary structures (QSs) in crystal lattices is challenging. To predict the physiological relevance of a particular QS, QSalign searches for homologous structures in which subunits interact in the same geometry. This approach proved accurate but was limited to structures already present in the Protein Data Bank (PDB). Here, we introduce a webserver (www.QSalign.org) allowing users to submit homo-oligomeric structures of their choice to the QSalign pipeline. Given a user-uploaded structure, the sequence is extracted and used to search homologs based on sequence similarity and PFAM domain architecture. If structural conservation is detected between a homolog and the user-uploaded QS, physiological relevance is inferred. The web server also generates alternative QSs with PISA and processes them the same way as the query submitted to widen the predictions. The result page also shows representative QSs in the protein family of the query, which is informative if no QS conservation was detected or if the protein appears monomeric. These representative QSs can also serve as a starting point for homology modeling.
(2022) Nucleic Acids Research. 50, D1, p. D534-D542 Abstract
The Protein Data Bank in Europe - Knowledge Base (PDBe-KB, https://pdbe-kb.org) is an open collaboration between world-leading specialist data resources contributing functional and biophysical annotations derived from or relevant to the Protein Data Bank (PDB). The goal of PDBe-KB is to place macromolecular structure data in their biological context by developing standardised data exchange formats and integrating functional annotations from the contributing partner resources into a knowledge graph that can provide valuable biological insights. Since we described PDBe-KB in 2019, there have been significant improvements in the variety of available annotation data sets and user functionality. Here, we provide an overview of the consortium, highlighting the addition of annotations such as predicted covalent binders, phosphorylation sites, effects of mutations on the protein structure and energetic local frustration. In addition, we describe a library of reusable web-based visualisation components and introduce new features such as a bulk download data service and a novel superposition service that generates clusters of superposed protein chains weekly for the whole PDB archive.
(2022) microPublication biology. 2022, 6, Abstract
Yeast divides asymmetrically, with an aging mother cell and a 'rejuvenated' daughter cell, and serves as a model organism for studying aging. At the same time, determining the age of yeast cells is technically challenging, requiring complex experimental setups or genetic strategies. We developed a synthetic system composed of two interacting oligomers, which forms condensates in living yeast cells. Here, we report that these synthetic condensates' size correlates with yeast replicative age, making these condensates age reporters for this model organism.
Altered Protein Abundance and Localization Inferred from Sites of Alternative Modification by Ubiquitin and SUMO(2021) Journal of Molecular Biology. 433, 21, 167219. Abstract
Protein modification by ubiquitin or SUMO can alter the function, stability or activity of target proteins. Previous studies have identified thousands of substrates that were modified by ubiquitin or SUMO on the same lysine residue. However, it remains unclear whether such overlap could result from a mere higher solvent accessibility, whether proteins containing those sites are associated with specific functional traits, and whether selectively perturbing their modification by ubiquitin or SUMO could result in different phenotypic outcomes. Here, we mapped reported lysine modification sites across the human proteome and found an enrichment of sites reported to be modified by both ubiquitin and SUMO. Our analysis uncovered thousands of proteins containing such sites, which we term Sites of Alternative Modification (SAMs). Among more than 36,000 sites reported to be modified by SUMO, 51.8% have also been reported to be modified by ubiquitin. SAM-containing proteins are associated with diverse biological functions including cell cycle, DNA damage, and transcriptional regulation. As such, our analysis highlights numerous proteins and pathways as putative targets for further elucidating the crosstalk between ubiquitin and SUMO. Comparing the biological and biochemical properties of SAMs versus other non-overlapping modification sites revealed that these sites were associated with altered cellular localization or abundance of their host proteins. Lastly, using S. cerevisiae as model, we show that mutating the SAM motif in a protein can influence its ubiquitination as well as its localization and abundance.
PDB-wide identification of physiological hetero-oligomeric assemblies based on conserved quaternary structure geometry(2021) Structure (London). Abstract
An accurate understanding of biomolecular mechanisms and diseases requires information on protein quaternary structure (QS). A critical challenge in inferring QS information from crystallography data is distinguishing biological interfaces from fortuitous crystal-packing contacts. Here, we employ QS conservation across homologs to infer the biological relevance of hetero-oligomers. We compare the structures and compositions of hetero-oligomers, which allow us to annotate 7,810 complexes as physiologically relevant, 1,060 as likely errors, and 1,432 with comparative information on subunit stoichiometry and composition. Excluding immunoglobulins, these annotations encompass over 51% of hetero-oligomers in the PDB. We curate a dataset of 577 hetero-oligomeric complexes to benchmark these annotations, which reveals an accuracy >94%. When homology information is not available, we compare QS across repositories (PDB, PISA, and EPPIC) to derive confidence estimates. This work provides high-quality annotations along with a large benchmark dataset of hetero-assemblies.
Abundance Imparts Evolutionary Constraints of Similar Magnitude on the Buried, Surface, and Disordered Regions of Proteins(2021) Frontiers in Molecular Biosciences. 8, 626729. Abstract
An understanding of the forces shaping protein conservation is key, both for the fundamental knowledge it represents and to allow for optimal use of evolutionary information in practical applications. Sequence conservation is typically examined at one of two levels. The first is a residue-level, where intra-protein differences are analyzed and the second is a protein-level, where inter-protein differences are studied. At a residue level, we know that solvent-accessibility is a prime determinant of conservation. By inverting this logic, we inferred that disordered regions are slightly more solvent-accessible on average than the most exposed surface residues in domains. By integrating abundance information with evolutionary data within and across proteins, we confirmed a previously reported strong surface-core association in the evolution of structured regions, but we found a comparatively weak association between disordered and structured regions. The facts that disordered and structured regions experience different structural constraints and evolve independently provide a unique setup to examine an outstanding question: why is a protein’s abundance the main determinant of its sequence conservation? Indeed, any structural or biophysical property linked to the abundance-conservation relationship should increase the relative conservation of regions concerned with that property (e.g., disordered residues with mis-interactions, domain residues with misfolding). Surprisingly, however, we found the conservation of disordered and structured regions to increase in equal proportion with abundance. This observation implies that either abundance-related constraints are structure-independent, or multiple constraints apply to different regions and perfectly balance each other.
(2021) Cell. 184, 2, p. 301-303 Abstract
Large-scale mapping of protein structures and their different states is crucial for gaining a mechanistic understanding of proteome function and regulation. In this issue of Cell, Cappelletti et al. achieve such a feat and identify hundreds of protein structural changes in response to outside stressors, providing a rich “structuromics” resource characterizing cellular adaptation.
(2020) Cell. 183, 6, p. 1462-1463 Abstract
Defining the principles underlying the organization of biomolecules within cells is a key challenge of current cell biology research. Persson et al. now identify a powerful layer of regulation that allows cells to decouple diffusion from temperature by modulating their intracellular viscosity. This so-called viscoadaptation is mediated through trehalose and glycogen activities, which alter diffusion dynamics and self-assembly propensity inside the cell globally.
(2020) Nature Chemical Biology. 16, 9, p. 939-945 Abstract
Protein self-organization is a hallmark of biological systems. Although the physicochemical principles governing protein-protein interactions have long been known, the principles by which such nanoscale interactions generate diverse phenotypes of mesoscale assemblies, including phase-separated compartments, remain challenging to characterize. To illuminate such principles, we create a system of two proteins designed to interact and form mesh-like assemblies. We devise a new strategy to map high-resolution phase diagrams in living cells, which provide self-assembly signatures of this system. The structural modularity of the two protein components allows straightforward modification of their molecular properties, enabling us to characterize how interaction affinity impacts the phase diagram and material state of the assemblies in vivo. The phase diagrams and their dependence on interaction affinity were captured by theory and simulations, including out-of-equilibrium effects seen in growing cells. Finally, we find that cotranslational protein binding suffices to recruit a messenger RNA to the designed micron-scale structures.
Proteomic analysis reveals the direct recruitment of intrinsically disordered regions to stress granules in S. cerevisiae(2020) Journal of Cell Science. 133, 13, 244657. Abstract
Stress granules (SGs) are stress-induced membraneless condensates that store non-translating mRNA and stalled translation initiation complexes. Although metazoan SGs are dynamic compartments where proteins can rapidly exchange with their surroundings, yeast SGs seem largely static. To gain a better understanding of yeast SGs, we identified proteins that sediment after heat shock using mass spectrometry. Proteins that sediment upon heat shock are biased toward a subset of abundant proteins that are significantly enriched in intrinsically disordered regions (IDRs). Heat-induced SG localization of over 80 proteins were confirmed using microscopy, including 32 proteins not previously known to localize to SGs. We found that several IDRs were sufficient to mediate SG recruitment. Moreover, the dynamic exchange of IDRs can be observed using fluorescence recovery after photobleaching, whereas other components remain immobile. Lastly, we showed that the IDR of the Ubp3 deubiquitinase was critical for yeast SG formation. This work shows that IDRs can be sufficient for SG incorporation, can remain dynamic in vitrified SGs, and can play an important role in cellular compartmentalization upon stress.This article has an associated First Person interview with the first author of the paper.
(2020) Nucleic Acids Research. 48, D1, p. D344-D353 Abstract
The Protein Data Bank in Europe-Knowledge Base (PDBe-KB, https://pdbe-kb.org) is a community-driven, collaborative resource for literature-derived, manually curated and computationally predicted structural and functional annotations of macro-molecular structure data, contained in the Protein Data Bank (PDB). The goal of PDBe-KB is two-fold: (i) to increase the visibility and reduce the fragmentation of annotations contributed by specialist data resources, and to make these data more findable, accessible, interoperable and reusable (FAIR) and (ii) to place macromolecular structure data in their biological context, thus facilitating their use by the broader scientific community in fundamental and applied research. Here, we describe the guidelines of this collaborative effort, the current status of contributed data, and the PDBe-KB infrastructure, which includes the data exchange format, the deposition system for added value annotations, the distributable database containing the assembled data, and programmatic access endpoints. We also describe a series of novel web-pages-the PDBe-KB aggregated views of structure data-which combine information on macromolecular structures from many PDB entries. We have recently released the first set of pages in this series, which provide an overview of available structural and functional information for a protein of interest, referenced by a UniProtKB accession.
Protein Abundance Biases the Amino Acid Composition of Disordered Regions to Minimize Non-functional Interactions(2019) Journal of Molecular Biology. 431, 24, p. 4978-4992 Abstract
In eukaryotes, disordered regions cover up to 50% of proteomes and mediate fundamental cellular processes. In contrast to globular domains, where about half of the amino acids are buried in the protein interior, disordered regions show higher solvent accessibility, which makes them prone to engage in non-functional interactions. Such interactions are exacerbated by the law of mass action, prompting the question of how they are minimized in abundant proteins. We find that interaction propensity or “stickiness” of disordered regions negatively correlates with their cellular abundance, both in yeast and human. Strikingly, considering yeast proteins where a large fraction of the sequence is disordered, the correlation between stickiness and abundance reaches R = − 0.55. Beyond this global amino-acid composition bias, we identify three rules by which amino-acid composition of disordered regions adjusts with high abundance. First, lysines are preferred over arginines, consistent with the latter amino acid being stickier than the former. Second, compensatory effects exist, whereby a sticky region can be tolerated if it is compensated by a distal non-sticky region. Third, such compensation requires a lower average stickiness at the same abundance when compared to a scenario where stickiness is homogeneous throughout the sequence. We validate these rules experimentally, employing them as different strategies to rescue an otherwise sticky protein fragment from aggregation. Our results highlight that non-functional interactions represent a significant constraint in cellular systems and reveal simple rules by which protein sequences adapt to that constraint.
(2019) Nature Communications. 10, 2960. Abstract
Clone collections of modified strains ("libraries") are a major resource for systematic studies with the yeast Saccharomyces cerevisiae. Construction of such libraries is time-consuming, costly and confined to the genetic background of a specific yeast strain. To overcome these limitations, we present CRISPR-Cas12a (Cpf1)-assisted tag library engineering (CASTLING) for multiplexed strain construction. CASTLING uses microarray-synthesized oligonucleotide pools and in vitro recombineering to program the genomic insertion of long DNA constructs via homologous recombination. One simple transformation yields pooled libraries with >90% of correctly tagged clones. Up to several hundred genes can be tagged in a single step and, on a genomic scale, approximately half of all genes are tagged with only similar to 10-fold oversampling. We report several parameters that affect tagging success and provide a quantitative targeted next-generation sequencing method to analyze such pooled collections. Thus, CASTLING unlocks avenues for increasing throughput in functional genomics and cell biology research.
(2019) Scientific data. 6, 64. Abstract
Proteins can self-associate with copies of themselves to form symmetric complexes called homomers. Homomers are widespread in all kingdoms of life and allow for unique geometric and functional properties, as reflected in viral capsids or allostery. Once a protein forms a homomer, however, its internal symmetry can compound the effect of point mutations and trigger uncontrolled self-assembly into high-order structures. We identified mutation hot spots for supramolecular assembly, which are predictable by geometry. Here, we present a dataset of descriptors that characterize these hot spot positions both geometrically and chemically, as well as computer scripts allowing the calculation and visualization of these properties for homomers of choice. Since the biological relevance of homomers is not readily available from their X-ray crystallographic structure, we also provide reliability estimates obtained by methods we recently developed. These data have implications in the study of disease-causing mutations, protein evolution and can be exploited in the design of biomaterials.
(2019) Angewandte Chemie - International Edition. 58, 17, p. 5514-5531 Abstract
Mutations and changes in a protein's environment are well known for their potential to induce misfolding and aggregation, including amyloid formation. Alternatively, such perturbations can trigger new interactions that lead to the polymerization of folded proteins. In contrast to aggregation, this process does not require misfolding and, to highlight this difference, we refer to it as agglomeration. This term encompasses the amorphous assembly of folded proteins as well as the polymerization in one, two, or three dimensions. We stress the remarkable potential of symmetric homo-oligomers to agglomerate even by single surface point mutations, and we review the double-edged nature of this potential: how aberrant assemblies resulting from agglomeration can lead to disease, but also how agglomeration can serve in cellular adaptation and be exploited for the rational design of novel biomaterials.
(2019) PLoS Biology. 17, 3, p. e3000182 3000182. Abstract
In experimental evolution, scientists evolve organisms in the lab, typically by challenging them to new environmental conditions. How best to evolve a desired trait? Should the challenge be applied abruptly, gradually, periodically, sporadically? Should one apply chemical mutagenesis, and do strains with high innate mutation rate evolve faster? What are ideal population sizes of evolving populations? There are endless strategies, beyond those that can be exposed by individual labs. We therefore arranged a community challenge, Evolthon, in which students and scientists from different labs were asked to evolve Escherichia coli or Saccharomyces cerevisiae for an abiotic stresslow temperature. About 30 participants from around the world explored diverse environmental and genetic regimes of evolution. After a period of evolution in each lab, all strains of each species were competed with one another. In yeast, the most successful strategies were those that used mating, underscoring the importance of sex in evolution. In bacteria, the fittest strain used a strategy based on exploration of different mutation rates. Different strategies displayed variable levels of performance and stability across additional challenges and conditions. This study therefore uncovers principles of effective experimental evolutionary regimens and might prove useful also for biotechnological developments of new strains and for understanding natural strategies in evolutionary arms races between species. Evolthon constitutes a model for community-based scientific exploration that encourages creativity and cooperation.
(2018) Nucleic Acids Research. 47, D1, p. D1245-D1249 Abstract
The ability to measure the abundance and visualize the localization of proteins across the yeast proteome has stimulated hypotheses on gene function and fueled discoveries. While the classic C' tagged GFP yeast library has been the only resource for over a decade, the recent development of the SWAT technology has led to the creation of multiple novel yeast libraries where new-generation fluorescent reporters are fused at the N' and C' of open reading frames. Efficient access to these data requires a user interface to visualize and compare protein abundance, localization and co-localization across cells, strains, and libraries. YeastRGB (www.yeastRGB.org) was designed to address such a need, through a user-friendly interface that maximizes informative content. It employs a compact display where cells are cropped and tiled together into a cell-grid.' This representation enables viewing dozens of cells for a particular strain within a display unit, and up to 30 display units can be arrayed on a standard high-definition screen. Additionally, the display unit allows users to control zoom-level and overlay of images acquired using different color channels. Thus, YeastRGB makes comparing abundance and localization efficient, across thousands of cells from different strains and libraries.
(2018) Nature Methods. 15, 8, p. 598-600 Abstract
Here we describe a C-SWAT library for high-throughput tagging of Saccharomyces cerevisiae open reading frames (ORFs). In 5,661 strains, we inserted an acceptor module after each ORF that can be efficiently replaced with tags or regulatory elements. We validated the library with targeted sequencing and tagged the proteome with bright fluorescent proteins to quantify the effect of heterologous transcription terminators on protein expression and to localize previously undetected proteins.
(2018) Nature Methods. 15, 8, p. 617-622 Abstract
Yeast libraries revolutionized the systematic study of cell biology. To extensively increase the number of such libraries, we used our previously devised SWAp-Tag (SWAT) approach to construct a genome-wide library of similar to 5,500 strains carrying the SWAT NOP1promoter-GFP module at the N terminus of proteins. In addition, we created six diverse libraries that restored the native regulation, created an overexpression library with a Cherry tag, or enabled protein complementation assays from two fragments of an enzyme or fluorophore. We developed methods utilizing these SWAT collections to systematically characterize the yeast proteome for protein abundance, localization, topology, and interactions.
(2018) Protein Complex Assembly. p. 357-375 (trueMethods in Molecular Biology). Abstract
A precise knowledge of the quaternary structure of proteins is essential to illuminate both their function and their evolution. The major part of our knowledge on quaternary structure is inferred from X-ray crystallography data, but this inference process is hard and error-prone. The difficulty lies in discriminating fortuitous protein contacts, which make up the lattice of protein crystals, from biological protein contacts that exist in the native cellular environment. Here, we review methods devised to discriminate between both types of contacts and describe resources for downloading protein quaternary structure information and identifying high-confidence quaternary structures. The use of high-confidence datasets of quaternary structures will be critical for the analysis of structural, functional, and evolutionary properties of proteins.
(2018) Nature Methods. 15, 1, p. 67-72 Abstract
Protein structures are key to understanding biomolecular mechanisms and diseases, yet their interpretation is hampered by limited knowledge of their biologically relevant quaternary structure (QS). A critical challenge in inferring QS information from crystallographic data is distinguishing biological interfaces from fortuitous crystal-packing contacts. Here, we tackled this problem by developing strategies for aligning and comparing QS states across both homologs and data repositories. QS conservation across homologs proved remarkably strong at predicting biological relevance and is implemented in two methods, QSalign and anti-QSalign, for annotating homo-oligomers and monomers, respectively. QS conservation across repositories is implemented in QSbio (http://www.QSbio.org), which approaches the accuracy of manual curation and allowed us to predict >100,000 QS states across the Protein Data Bank. Based on this high-quality data set, we analyzed pairs of structurally conserved interfaces, and this analysis revealed a striking plasticity whereby evolutionary distant interfaces maintain similar interaction geometries through widely divergent chemical properties.
(2017) Nature. 548, 7666, p. 244-247 Abstract
The self-association of proteins into symmetric complexes is ubiquitous in all kingdoms of life(1-6). Symmetric complexes possess unique geometric and functional properties, but their internal symmetry can pose a risk. In sickle-cell disease, the symmetry of haemoglobin exacerbates the effect of a mutation, triggering assembly into harmful fibrils(7). Here we examine the universality of this mechanism and its relation to protein structure geometry. We introduced point mutations solely designed to increase surface hydrophobicity among 12 distinct symmetric complexes from Escherichia coli. Notably, all responded by forming supramolecular assemblies in vitro, as well as in vivo upon heterologous expression in Saccharomyces cerevisiae. Remarkably, in four cases, micrometre-long fibrils formed in vivo in response to a single point mutation. Biophysical measurements and electron microscopy revealed that mutants self-assembled in their folded states and so were not amyloid-like. Structural examination of 73 mutants identified supramolecular assembly hot spots predictable by geometry. A subsequent structural analysis of 7,471 symmetric complexes showed that geometric hot spots were buffered chemically by hydrophilic residues, suggesting a mechanism preventing mis-assembly of these regions. Thus, point mutations can frequently trigger folded proteins to self-assemble into higher-order structures. This potential is counterbalanced by negative selection and can be exploited to design nanomaterials in living cells.
(2017) PLoS Computational Biology. 13, 4, e1005499. Abstract
High-throughput in vitro methods have been extensively applied to identify linear information that encodes peptide recognition. However, these methods are limited in number of peptides, sequence variation, and length of peptides that can be explored, and often produce solutions that are not found in the cell. Despite the large number of methods developed to attempt addressing these issues, the exhaustive search of linear information encoding protein- peptide recognition has been so far physically unfeasible. Here, we describe a strategy, called DALEL, for the exhaustive search of linear sequence information encoded in proteins that bind to a common partner. We applied DALEL to explore binding specificity of SH3 domains in the budding yeast Saccharomyces cerevisiae. Using only the polypeptide sequences of SH3 domain binding proteins, we succeeded in identifying the majority of known SH3 binding sites previously discovered either in vitro or in vivo. Moreover, we discovered a number of sites with both non-canonical sequences and distinct properties that may serve ancillary roles in peptide recognition. We compared DALEL to a variety of stateof- the-art algorithms in the blind identification of known binding sites of the human Grb2 SH3 domain. We also benchmarked DALEL on curated biological motifs derived from the ELM database to evaluate the effect of increasing/decreasing the enrichment of the motifs. Our strategy can be applied in conjunction with experimental data of proteins interacting with a common partner to identify binding sites among them. Yet, our strategy can also be applied to any group of proteins of interest to identify enriched linear motifs or to exhaustively explore the space of linear information encoded in a polypeptide sequence. Finally, we have developed a webserver located at http://michnick.bcm.umontreal.ca/dalel,offering user-friendly interface and providing different scenarios utilizing DALEL.
Protein-fragment complementation assays for large-scale analysis, functional dissection, and spatiotemporal dynamic studies of protein-protein interactions in living cells(2016) Cold Spring Harbor Protocols. 2016, 11, p. 917-919 Abstract
Protein-fragment complementation assays (PCAs) comprise a family of assays that can be used to study protein-protein interactions (PPIs), conformation changes, and protein complex dimensions. We developed PCAs to provide simple and direct methods for the study of PPIs in any living cell, subcellular compartments or membranes, multicellular organisms, or in vitro. Because they are complete assays, requiring no cell-specific components other than reporter fragments, they can be applied in any context. PCAs provide a general strategy for the detection of proteins expressed at endogenous levels within appropriate subcellular compartments and with normal posttranslational modifications, in virtually any cell type or organism under any conditions. Here we introduce a number of applications of PCAs in budding yeast, Saccharomyces cerevisiae. These applications represent the full range of PPI characteristics that might be studied, from simple detection on a large scale to visualization of spatiotemporal dynamics.
The dihydrofolate reductase protein-fragment complementation assay: A survival-selection assay for large-scale analysis of protein-protein interactions(2016) Cold Spring Harbor Protocols. 2016, 11, p. 963-971 Abstract
Protein-fragment complementation assays (PCAs) can be used to study protein-protein interactions (PPIs) in any living cell, in vivo or in vitro, in any subcellular compartment or membranes. Here, we present a detailed protocol for performing and analyzing a high-throughput PCA screening to study PPIs in yeast, using dihydrofolate reductase (DHFR) as the reporter protein. The DHFR PCA is a simple survival-selection assay in which Saccharomyces cerevisiae DHFR (scDHFR) is inhibited by methotrexate, thus preventing nucleotide synthesis and causing arrest of cell division. Complementation of cells with a methotrexate-insensitive murine DHFR restores nucleotide synthesis, allowing cell proliferation. The methotrexate-resistant DHFR has two mutations (L22F and F31S) and is 10,000 times less sensitive to methotrexate than wild-type scDHFR, but retains full catalytic activity. The DHFR PCA is sensitive enough for PPIs to be detected for open reading frame (ORF)-PCA fragments expressed off of their endogenous promoters.
Evolution of domain-peptide interactions to coadapt specificity and affinity to functional diversity(2016) Proceedings of the National Academy of Sciences of the United States of America. 113, 27, p. E3862-E3871 Abstract
Evolution of complexity in eukaryotic proteomes has arisen, in part, through emergence of modular independently folded domains mediating protein interactions via binding to short linear peptides in proteins. Over 30 years, structural properties and sequence preferences of these peptides have been extensively characterized. Less successful, however, were efforts to establish relationships between physicochemical properties and functions of domain-peptide interactions. To our knowledge, we have devised the first strategy to exhaustively explore the binding specificity of protein domain-peptide interactions. We applied the strategy to SH3 domains to determine the properties of their binding peptides starting from various experimental data. The strategy identified the majority (similar to 70%) of experimentally determined SH3 binding sites. We discovered mutual relationships among binding specificity, binding affinity, and structural properties and evolution of linear peptides. Remarkably, we found that these properties are also related to functional diversity, defined by depth of proteins within hierarchies of gene ontologies. Our results revealed that linear peptides evolved to coadapt specificity and affinity to functional diversity of domain-peptide interactions. Thus, domain-peptide interactions follow human-constructed gene ontologies, which suggest that our understanding of biological process hierarchies reflect the way chemical and thermodynamic properties of linear peptides and their interaction networks, in general, have evolved.
(2015) eLife. 4, 04241. Abstract
Brains organize behavior and physiology to optimize the response to threats or opportunities. We dissect how 21% O-2, an indicator of surface exposure, reprograms C. elegans' global state, inducing sustained locomotory arousal and altering expression of neuropeptides, metabolic enzymes, and other non-neural genes. The URX O-2-sensing neurons drive arousal at 21% O-2 by tonically activating the RMG interneurons. Stimulating RMG is sufficient to switch behavioral state. Ablating the ASH, ADL or ASK sensory neurons connected to RMG by gap junctions does not disrupt arousal. However, disrupting cation currents in these neurons curtails RMG neurosecretion and arousal. RMG signals high O-2 by peptidergic secretion. Neuropeptide reporters reveal neural circuit state, as neurosecretion stimulates neuropeptide expression. Neural imaging in unrestrained animals shows that URX and RMG encode O-2 concentration rather than behavior, while the activity of downstream interneurons such as AVB and AIY reflect both O-2 levels and the behavior being executed.
(2015) Structure. 23, 1, p. 3-5 Abstract
(2014) PLoS ONE. 9, 9, e106081. Abstract
Linear motifs mediate a wide variety of cellular functions, which makes their characterization in protein sequences crucial to understanding cellular systems. However, the short length and degenerate nature of linear motifs make their discovery a difficult problem. Here, we introduce MotifHound, an algorithm particularly suited for the discovery of small and degenerate linear motifs. MotifHound performs an exact and exhaustive enumeration of all motifs present in proteins of interest, including all of their degenerate forms, and scores the overrepresentation of each motif based on its occurrence in proteins of interest relative to a background (e.g., proteome) using the hypergeometric distribution. To assess MotifHound, we benchmarked it together with state-of-the-art algorithms. The benchmark consists of 11,880 sets of proteins from S. cerevisiae; in each set, we artificially spiked-in one motif varying in terms of three key parameters, (i) number of occurrences, (ii) length and (iii) the number of degenerate or "wildcard'' positions. The benchmark enabled the evaluation of the impact of these three properties on the performance of the different algorithms. The results showed that MotifHound and SLiMFinder were the most accurate in detecting degenerate linear motifs. Interestingly, MotifHound was 15 to 20 times faster at comparable accuracy and performed best in the discovery of highly degenerate motifs. We complemented the benchmark by an analysis of proteins experimentally shown to bind the FUS1 SH3 domain from S. cerevisiae. Using the full-length protein partners as sole information, MotifHound recapitulated most experimentally determined motifs binding to the FUS1 SH3 domain. Moreover, these motifs exhibited properties typical of SH3 binding peptides, e. g., high intrinsic disorder and evolutionary conservation, despite the fact that none of these properties were used as prior information. MotifHound is available (http://michnick.bcm.umontreal.ca or http://t
Different subunits belonging to the same protein complex often exhibit discordant expression levels and evolutionary properties(2014) Current Opinion in Structural Biology. 26, p. 113-120 Abstract
Hetero-oligomeric protein complexes are involved in many of the key processes in cells. Given that the subunits of a complex function together, it has often been expected to find that (i) they are expressed at similar levels in cells; (ii) they are simultaneously present or absent in genomes; and that (iii) the effects on fitness of deleting their genes should be similar. Such a coherence is, however, often found to be weak or absent. Multi-functionality of subunits and mechanisms of complex assembly are discussed as possible sources for the lack of coherence.
High-Resolution Mapping of Protein Concentration Reveals Principles of Proteome Architecture and Adaptation(2014) Cell Reports. 7, 4, p. 1333-1340 Abstract
A single yeast cell contains a hundred million protein molecules. How these proteins are organized to orchestrate living processes is a central question in biology. To probe this organization in vivo, we measured the local concentration of proteins based on the strength of their nonspecific interactions with a neutral reporter protein. We first used a cytosolic reporter and measured local concentrations for similar to 2,000 proteins in S. cerevisiae, with accuracy comparable to that of mass spectrometry. Localizing the reporter to membranes specifically increased the local concentration measured for membrane proteins. Comparing the concentrations measured by both reporters revealed that encounter frequencies between proteins are primarily dictated by their abundances. However, to change these encounter frequencies and restructure the proteome, as in adaptation, we find that changes in localization have more impact than changes in abundance. These results highlight how protein abundance and localization contribute to proteome organization and reorganization.
(2013) Cell. 155, 5, p. 983-989 Abstract
Network biologists attempt to extract meaningful relationships among genes or their products from very noisy data. We argue that what we categorize as noisy data may sometimes reflect noisy biology and therefore may shield a hidden meaning about how networks evolve and how matter is organized in the cell. We present practical solutions, based on existing evolutionary and biophysical concepts, through which our understanding of cell biology can be enormously enriched.
(2013) Oligomerization In Health And Disease. p. 25-51 (trueProgress in Molecular Biology and Translational Science). Abstract
In the protein universe, 30-50% of proteins self-assemble to form symmetrical complexes consisting of multiple copies of themselves, called homomers. The prevalence of homomers motivates us to review many of their properties. In Section 1, we describe the methods and challenges associated with quaternary structure inference these methods are indeed at the basis of any analysis on homomers. In Section 2, we describe the morphological properties of homomers, as well as the database 3DComplex, which provides a taxonomy for both homomeric and heteromeric protein complexes. In Section 3, we review interface properties of homomeric complexes. In Section 4, we then present recent findings on the evolution of homomer interfaces, which we link in Section 5 to the evolution of homomers as entire entities. In Section 6, we discuss mechanisms involved in their assembly and how these mechanisms can be linked to evolution.
(2012) Proceedings of the National Academy of Sciences of the United States of America. 109, 50, p. 20461-20466 Abstract
In living cells, functional protein-protein interactions compete with a much larger number of nonfunctional, or promiscuous, interactions. Several cellular properties contribute to avoiding unwanted protein interactions, including regulation of gene expression, cellular compartmentalization, and high specificity and affinity of functional interactions. Here we investigate whether other mechanisms exist that shape the sequence and structure of proteins to favor their correct assembly into functional protein complexes. To examine this question, we project evolutionary and cellular abundance information onto 397, 196, and 631 proteins of known 3D structure from Escherichia coli, Saccharomyces cerevisiae, and Homo sapiens, respectively. On the basis of amino acid frequencies in interface patches versus the solvent-accessible protein surface, we define a propensity or "stickiness" scale for each of the 20 amino acids. We find that the propensity to interact in a nonspecific manner is inversely correlated with abundance. In other words, high abundance proteins have less sticky surfaces. We also find that stickiness constrains protein evolution, whereby residues in sticky surface patches are more conserved than those found in nonsticky patches. Finally, we find that the constraint imposed by stickiness on protein divergence is proportional to protein abundance, which provides mechanistic insights into the correlation between protein conservation and protein abundance. Overall, the avoidance of nonfunctional interactions significantly influences the physico-chemical and evolutionary properties of proteins. Remarkably, the effects observed are consistently larger in E. coli and S. cerevisiae than in H. sapiens, suggesting that promiscuous protein-protein interactions may be freer to accumulate in the human lineage.
Protein abundance is key to distinguish promiscuous from functional phosphorylation based on evolutionary information(2012) Philosophical Transactions Of The Royal Society B-Biological Sciences. 367, 1602, p. 2594-2606 Abstract
In eukaryotic cells, protein phosphorylation is an important and widespread mechanism used to regulate protein function. Yet, of the thousands of phosphosites identified to date, only a few hundred at best have a characterized function. It was recently shown that these functional sites are significantly more conserved than phosphosites of unknown function, stressing the importance of considering evolutionary conservation in assessing the global functional landscape of phosphosites. This leads us to review studies that examined the impact of phosphorylation on evolutionary conservation. While all these studies have shown that conservation is greater among phosphorylated sites compared with non-phosphorylated ones, the magnitude of this difference varies greatly. Further, not all studies have considered key factors that may influence the rate of phosphosite evolution. Such key factors are their localization in ordered or disordered regions, their stoichiometry or the abundance of their corresponding protein. Here we take into account all of these factors simultaneously, which reveals remarkable evolutionary patterns. First, while it is well established that protein conservation increases with abundance, we show that phosphosites partly follow an opposite trend. More precisely, Saccharomyces cerevisiae phosphosites present among abundant proteins are 1.5 times more likely to diverge in the closely related species Saccharomyces bayanus when compared with phosphosites present in the 5 per cent least abundant proteins. Second, we show that conservation is coupled to stoichiometry, whereby sites frequently phosphorylated are more conserved than those rarely phosphorylated. Finally, we provide a model of functional and noisy or 'accidental' phosphorylation that explains these observations.
(2010) Journal of Molecular Biology. 403, 4, p. 660-670 Abstract
Analysis of proteins commonly requires the partition of their structure into regions such as the surface, interior, or interface. Despite the frequent use of such categorization, no consensus definition seems to exist. This study thus aims at providing a definition that is general, is simple to implement, and yields new biological insights. This analysis relies on 397, 196, and 701 protein structures from Escherichia coli, Saccharomyces cerevisiae, and Homo sapiens, respectively, and the conclusions are consistent across all three species. A threshold of 25% relative accessible surface area best segregates amino acids at the interior and at the surface. This value is further used to extend the core-rim model of protein-protein interfaces and to introduce a third region called support. Interface core, rim, and support regions contain similar numbers of residues on average, but core residues contribute over two-thirds of the contact surface. The amino acid composition of each region remains similar across different organisms and interface types. The interface core composition is intermediate between the surface and the interior, but the compositions of the support and the rim are virtually identical with those of the interior and the surface, respectively. The support and rim could thus "preexist" in proteins, and evolving a new interaction could require mutations to form an interface core only. Using the interface regions defined, it is shown through simulations that only two substitutions are necessary to shift the average composition of a 1000-Å2 surface patch involving ~28 residues to that of an equivalent interface. This analysis and conclusions will help understand the notion of promiscuity in protein-protein interaction networks.
(2010) Molecular Systems Biology. 6, 423. Abstract
Amoeba use phagocytosis to internalize bacteria as a source of nutrients, whereas multicellular organisms utilize this process as a defense mechanism to kill microbes and, in vertebrates, initiate a sustained immune response. By using a large-scale approach to identify and compare the proteome and phosphoproteome of phagosomes isolated from distant organisms, and by comparative analysis over 39 taxa, we identified an 'ancient' core of phagosomal proteins around which the immune functions of this organelle have likely organized. Our data indicate that a larger proportion of the phagosome proteome, compared with the whole cell proteome, has been acquired through gene duplication at a period coinciding with the emergence of innate and adaptive immunity. Our study also characterizes in detail the acquisition of novel proteins and the significant remodeling of the phagosome phosphoproteome that contributed to modify the core constituents of this organelle in evolution. Our work thus provides the first thorough analysis of the changes that enabled the transformation of the phagosome from a phagotrophic compartment into an organelle fully competent for antigen presentation. Molecular Systems Biology 6: 423; published online 19 October 2010; doi: 10.1038/msb.2010.80
(2010) Biochemical Society Transactions. 38, 4, p. 879-882 Abstract
Homo-oligomeric protein complexes are functionally vital and highly abundant in living cells. In the present article, we review our current understanding of their geometry and evolution, including aspects of the symmetry of these complexes and their interaction interfaces. Also, we briefly discuss the pathway of their assembly in solution.
(2010) Expert Review of Proteomics. 7, 3, p. 319-322 Abstract
This Keystone symposium, entitled 'Biomolecular Interactions and Networks: function and disease', was held in Quebec City, Canada, 7-12 March 2010. The conference was distinctive in that it bridged two fields that may be perceived as having little in common: structural and systems biology. However, the growth in structural and omics data brings these two fields closer and closer. Indeed, in two sections of this article we cover talks on systematic analyses of protein structures, as well as systems level approaches that incorporate structural information. In two other sections, we report studies that aim at charting and analyzing cellular systems, and finally we discuss talks that pointed to the issue of promiscuity in biological networks.
(2010) Science. 328, 5981, p. 983-984 Abstract
Protein kinases and phosphatases may form a collaborative network of interactions to mediate cellular responses.
Physicochemical principles that regulate the competition between functional and dysfunctional association of proteins(2009) Proceedings of the National Academy of Sciences of the United States of America. 106, 25, p. 10159-10164 Abstract
To maintain protein homeostasis, a variety of quality control mechanisms, such as the unfolded protein response and the heat shock response, enable proteins to fold and to assemble into functional complexes while avoiding the formation of aberrant and potentially harmful aggregates. We show here that a complementary contribution to the regulation of the interactions between proteins is provided by the physicochemical properties of their amino acid sequences. The results of a systematic analysis of the protein-protein complexes in the Protein Data Bank (PDB) show that interface regions are more prone to aggregate than other surface regions, indicating that many of the interactions that promote the formation of functional complexes, including hydrophobic and electrostatic forces, can potentially also cause abnormal intermolecular association. We also show, however, that aggregation-prone interfaces are prevented from triggering uncontrolled assembly by being stabilized into their functional conformations by disulfide bonds and salt bridges. These results indicate that functional and dysfunctional association of proteins are promoted by similar forces but also that they are closely regulated by the presence of specific interactions that stabilize native states.
(2009) Trends in Genetics. 25, 5, p. 193-197 Abstract
Owing to their crucial roles in regulating protein function, phosphorylation sites (phosphosites) are expected to be evolutionarily conserved. However, mixed results regarding this prediction have been reported. We resolve these contrasting conclusions to show that phosphosites are, on average, more conserved than non-phosphorylated equivalent residues when their enrichment in disordered regions of proteins is taken into account. Phosphosites of known function are dramatically more conserved than those with no characterized function, indicating that the apparent rapid evolution of phosphoproteomes results from a large fraction of phosphosites being non-functional. Our findings highlight the need to use evolutionary information to identify functional regulatory features such as post-translational modifications of eukaryotic proteomes.
(2009) Science Signaling. 2, 60, 11. Abstract
Any engineered device should certainly not contain nonfunctional components, for this would be a waste of energy and money. In contrast, evolutionary theory tells us that biological systems need not be optimized and may very well accumulate nonfunctional elements. Mutational and demographic processes contribute to the cluttering of eukaryotic genomes and transcriptional networks with "junk" DNA and spurious DNA binding sites. Here, we question whether such a notion should be applied to protein interactomes-that is, whether these protein interactomes are expected to contain a fraction of nonselected, nonfunctional protein-protein interactions (PPIs), which we term "noisy." We propose a simple relationship between the fraction of noisy interactions expected in a given organism and three parameters: (i) the number of mutations needed to create and destroy interactions, (ii) the size of the proteome, and (iii) the fitness cost of noisy interactions. All three parameters suggest that noisy PPIs are expected to exist. Their existence could help to explain why PPIs determined from large-scale studies often lack functional relationships between interacting proteins, why PPIs are poorly conserved across organisms, and why the PPI space appears to be immensely large. Finally, we propose experimental strategies to estimate the fraction of evolutionary noise in PPI networks.
(2009) Physical Review Letters. 102, 11, 118106. Abstract
We introduce a simple "patchy particle" model to study the thermodynamics and dynamics of self-assembly of homomeric protein complexes. Our calculations allow us to rationalize recent results for dihedral complexes. Namely, why evolution of such complexes naturally takes the system into a region of interaction space where (i) the evolutionarily newer interactions are weaker, (ii) subcomplexes involving the stronger interactions are observed to be thermodynamically stable on destabilization of the protein-protein interactions, and (iii) the self-assembly dynamics are hierarchical with these same subcomplexes acting as kinetic intermediates.
(2008) Nature. 453, 7199, p. 1262-1265 Abstract
A homomer is formed by self- interacting copies of a protein unit. This is functionally important(1,2), as in allostery(3-5), and structurally crucial because mis- assembly of homomers is implicated in disease(6,7). Homomers are widespread, with 50 - 70% of proteins with a known quaternary state assembling into such structures(8,9). Despite their prevalence, their role in the evolution of cellular machinery(10,11) and the potential for their use in the design of new molecular machines(12,13), little is known about the mechanisms that drive formation of homomers at the level of evolution and assembly in the cell(9,14). Here we present an analysis of over 5,000 unique atomic structures and show that the quaternary structure of homomers is conserved in over 70% of protein pairs sharing as little as 30% sequence identity. Where quaternary structure is not conserved among the members of a protein family, a detailed investigation revealed well- defined evolutionary pathways by which proteins transit between different quaternary structure types. Furthermore, we show by perturbing subunit interfaces within complexes and by mass spectrometry analysis(15), that the ( dis) assembly pathway mimics the evolutionary pathway. These data represent a molecular analogy to Haeckel's evolutionary paradigm of embryonic development, where an intermediate in the assembly of a complex represents a form that appeared in its own evolutionary history. Our model of self- assembly allows reliable prediction of evolution and assembly of a complex solely from its crystal structure.
(2008) Current Opinion in Structural Biology. 18, 3, p. 349-357 Abstract
The central role of protein-protein interactions (PPIs) in biology has stimulated colossal efforts to identify thousands of them in several organisms. The resulting PPI maps are commonly represented as graphs, where nodes denote proteins and edges represent physical interactions. However, the methods used to generate PPI data on a large scale do not readily allow one to discriminate features such as interaction strength (affinity), type (protein-protein or protein-peptide interaction) or spatiotemporal existence (where and when the proteins are present and interact). Yet, in recent years, a number of studies have tackled these limitations by projecting additional information onto PPIs, revealing novel properties in terms of their evolution and dynamics. In this review we examine these properties both at the binary interaction level and at the network level. We suggest that the diverse and sometimes contradictory results described by different research groups are mostly due to incomplete data coverage and limited data types. Finally, we discuss recently developed methods that will improve this picture in the future.
(2007) Structure. 15, 11, p. 1364-1367 Abstract
PiQSi facilitates the manual investigation of the quaternary structure of protein complexes in the Protein Data Bank (PDB). Users can browse and obtain an overview of the quaternary structure information of a given protein together with its evolutionary relatives, which helps in the determination of the biological quaternary state. I have used this framework to annotate over 10,000 structures from the PDB Biological Unit and corrected the quaternary state of similar to 15% of them. A benchmark shows that the annotations are of high quality and stresses the need for manual curation, in particular for ambiguous cases such as proteins in equilibrium between two quaternary states. The similar to 10,000 annotations already in the database can be used to improve the accuracy of analyses on protein structure or to benchmark methods that predict protein quaternary structure. In addition, PiQSi incorporates a community-based curation system, which I hope will allow us to reach an accurate and complete description of the biological quaternary state of proteins in PDB. PiQSi is accessible at http://www.PiQSi.org/.
(2007) BMC Bioinformatics. 8, 3. Abstract
Using a previously developed automated method for enzyme annotation, we report the re-annotation of the ENZYME database and the analysis of local error rates per class. In control experiments, we demonstrate that the method is able to correctly re-annotate 91% of all Enzyme Classification ( EC) classes with high coverage ( 755 out of 827). Only 44 enzyme classes are found to contain false positives, while the remaining 28 enzyme classes are not represented. We also show cases where the re-annotation procedure results in partial overlaps for those few enzyme classes where a certain inconsistency might appear between homologous proteins, mostly due to function specificity. Our results allow the interactive exploration of the EC hierarchy for known enzyme families as well as putative enzyme sequences that may need to be classified within the EC hierarchy. These aspects of our framework have been incorporated into a web-server, called CORRIE, which stands for Correspondence Indicator Estimation and allows the interactive prediction of a functional class for putative enzymes from sequence alone, supported by probabilistic measures in the context of the pre-calculated Correspondence Indicators of known enzymes with the functional classes of the EC hierarchy. The CORRIE server is available at: http://www.genomes.org/services/corrie/.
(2007) GENOME BIOLOGY. 8, 4, 51. Abstract
Background: Cellular functions are accomplished by the concerted actions of functional modules. The mechanisms driving the emergence and evolution of these modules are still unclear. Here we investigate the evolutionary origins of protein complexes, modules in physical protein-protein interaction networks.Results: We studied protein complexes in Saccharomyces cerevisiae, complexes of known three-dimensional structure in the Protein Data Bank and clusters of pairwise protein interactions in the networks of several organisms. We found that duplication of homomeric interactions, a large class of protein interactions, frequently results in the formation of complexes of paralogous proteins. This route is a common mechanism for the evolution of complexes and clusters of protein interactions. Our conclusions are further confirmed by theoretical modelling of network evolution. We propose reasons for why this is favourable in terms of structure and function of protein complexes.Conclusion: Our study provides the first insight into the evolution of functional modularity in protein-protein interaction networks, and the origins of a large class of protein complexes.
(2006) PLoS Computational Biology. 2, 11, p. 1395-1406 155. Abstract
Most of the proteins in a cell assemble into complexes to carry out their function. It is therefore crucial to understand the physicochemical properties as well as the evolution of interactions between proteins. The Protein Data Bank represents an important source of information for such studies, because more than half of the structures are homo- or heteromeric protein complexes. Here we propose the first hierarchical classification of whole protein complexes of known 3-D structure, based on representing their fundamental structural features as a graph. This classification provides the first overview of all the complexes in the Protein Data Bank and allows nonredundant sets to be derived at different levels of detail. This reveals that between one-half and two-thirds of known structures are multimeric, depending on the level of redundancy accepted. We also analyse the structures in terms of the topological arrangement of their subunits and find that they form a small number of arrangements compared with all theoretically possible ones. This is because most complexes contain four subunits or less, and the large majority are homomeric. In addition, there is a strong tendency for symmetry in complexes, even for heteromeric complexes. Finally, through comparison of Biological Units in the Protein Data Bank with the Protein Quaternary Structure database, we identified many possible errors in quaternary structure assignments. Our classification, available as a database and Web server at http://www.3Dcomplex.org, will be a starting point for future work aimed at understanding the structure and evolution of protein complexes.
(2006) Philosophical Transactions Of The Royal Society B-Biological Sciences. 361, 1467, p. 507-517 Abstract
Modularity is an attribute of a system that can be decomposed into a set of cohesive entities that are loosely coupled. Many cellular networks can be decomposed into functional modules-each functionally separable from the other modules. The protein complexes in physical protein interaction networks are a good example of this, and here we focus on their origins and evolution. We investigate the emergence of protein complexes and physical interactions between proteins by duplication, and review other mechanisms. We dissect the dataset of protein complexes of known three-dimensional structure, and show that roughly 90% of these complexes contain contacts between identical proteins within the same complex. Proteins that are shared across different complexes occur frequently, and they tend to be essential genes more often than members of a single protein complex. We also provide a perspective on the evolutionary mechanisms driving the growth of other modular cellular networks such as transcriptional regulatory and metabolic networks.
(2005) BMC Bioinformatics. 6, 302. Abstract
Background: One of the most evident achievements of bioinformatics is the development of methods that transfer biological knowledge from characterised proteins to uncharacterised sequences. This mode of protein function assignment is mostly based on the detection of sequence similarity and the premise that functional properties are conserved during evolution. Most automatic approaches developed to date rely on the identification of clusters of homologous proteins and the mapping of new proteins onto these clusters, which are expected to share functional characteristics.Results: Here, we inverse the logic of this process, by considering the mapping of sequences directly to a functional classification instead of mapping functions to a sequence clustering. In this mode, the starting point is a database of labelled proteins according to a functional classification scheme, and the subsequent use of sequence similarity allows defining the membership of new proteins to these functional classes. In this framework, we define the Correspondence Indicators as measures of relationship between sequence and function and further formulate two Bayesian approaches to estimate the probability for a sequence of unknown function to belong to a functional class. This approach allows the parametrisation of different sequence search strategies and provides a direct measure of annotation error rates. We validate this approach with a database of enzymes labelled by their corresponding four-digit EC numbers and analyse specific cases.Conclusion: The performance of this method is significantly higher than the simple strategy consisting in transferring the annotation from the highest scoring BLAST match and is expected to find applications in automated functional annotation pipelines.