(2020) Nature (London). 587, p. 291-296 Abstract
Transcription factors recognize specific genomic sequences to regulate complex gene-expression programs. Although it is well-established that transcription factors bind to specific DNA sequences using a combination of base readout and shape recognition, some fundamental aspects of protein-DNA binding remain poorly understood. Many DNA-binding proteins induce changes in the structure of the DNA outside the intrinsic B-DNA envelope. However, how the energetic cost that is associated with distorting the DNA contributes to recognition has proven difficult to study, because the distorted DNA exists in low abundance in the unbound ensemble. Here we use a high-throughput assay that we term SaMBA (saturation mismatch-binding assay) to investigate the role of DNA conformational penalties in transcription factor-DNA recognition. In SaMBA, mismatched base pairs are introduced to pre-induce structural distortions in the DNA that are much larger than those induced by changes in the Watson-Crick sequence. Notably, approximately 10% of mismatches increased transcription factor binding, and for each of the 22 transcription factors that were examined, at least one mismatch was found that increased the binding affinity. Mismatches also converted non-specific sites into high-affinity sites, and high-affinity sites into 'super sites' that exhibit stronger affinity than any known canonical binding site. Determination of high-resolution X-ray structures, combined with nuclear magnetic resonance measurements and structural analyses, showed that many of the DNA mismatches that increase binding induce distortions that are similar to those induced by protein binding-thus prepaying some of the energetic cost incurred from deforming the DNA. Our work indicates that conformational penalties are a major determinant of protein-DNA recognition, and reveals mechanisms by which mismatches can recruit transcription factors and thus modulate replication and repair activities in the cell.
(2020) Biomolecules (Basel, Switzerland). 10, 9, 1299. Abstract
In the process of transcription initiation by RNA polymerase, promoter DNA sequences affect multiple reaction pathways determining the productivity of transcription. However, the question of how the molecular mechanism of transcription initiation depends on the sequence properties of promoter DNA remains poorly understood. Here, combining the statistical mechanical approach with high-throughput sequencing results, we characterize abortive transcription and pausing during transcription initiation byEscherichia coliRNA polymerase at a genome-wide level. Our results suggest that initially transcribed sequences, when enriched with thymine bases, contain the signal for inducing abortive transcription, whereas certain repetitive sequence elements embedded in promoter regions constitute the signal for inducing pausing. Both signals decrease the productivity of transcription initiation. Based on solution NMR and in vitro transcription measurements, we suggest that repetitive sequence elements within the promoter DNA modulate the nonlocal base pair stability of its double-stranded form. This stability profoundly influences the reaction coordinates of the productive initiation via pausing.
(2019) JoVE journal. 2019, 152, Abstract
DNA primase synthesizes short RNA primers that initiate DNA synthesis of Okazaki fragments on the lagging strand by DNA polymerase during DNA replication. The binding of prokaryotic DnaG-like primases to DNA occurs at a specific trinucleotide recognition sequence. It is a pivotal step in the formation of Okazaki fragments. Conventional biochemical tools that are used to determine the DNA recognition sequence of DNA primase provide only limited information. Using a high-throughput microarray-based binding assay and consecutive biochemical analyses, it has been shown that 1) the specific binding context (flanking sequences of the recognition site) influences the binding strength of the DNA primase to its template DNA, and 2) stronger binding of primase to the DNA yields longer RNA primers, indicating higher processivity of the enzyme. This method combines PBM and primase activity assay and is designated as high-throughput primase profiling (HTPP), and it allows characterization of specific sequence recognition by DNA primase in unprecedented time and scalability.
(2019) Biochimica et biophysica acta. General subjects. 1863, 9, p. 1343-1350 Abstract
The signal transducer and activator of transcription 3 (STAT3) protein is activated by phosphorylation of a specific tyrosine residue (Tyr705) in response to various extracellular signals. STAT3 activity was also found to be regulated by acetylation of Lys685. However, the molecular mechanism by which Lys685 acetylation affects the transcriptional activity of STAT3 remains elusive. By genetically encoding the co-translational incorporation of acetyl-lysine into position Lys685 and co-expression of STAT3 with the Elk receptor tyrosine kinase, we were able to characterize site-specifically acetylated, and simultaneously acetylated and phosphorylated STAT3. We measured the effect of acetylation on the crystal structure, and DNA binding affinity and specificity of Tyr705-phosphorylated and non-phosphorylated STAT3. In addition, we monitored the deacetylation of acetylated Lys685 by reconstituting the mammalian enzymatic deacetylation reaction in live bacteria. Surprisingly, we found that acetylation, per se, had no effect on the crystal structure, and DNA binding affinity or specificity of STAT3, implying that the previously observed acetylation-dependent transcriptional activity of STAT3 involves an additional cellular component. In addition, we discovered that Tyr705-phosphorylation protects Lys685 from deacetylation in bacteria, providing a new possible explanation for the observed correlation between STAT3 activity and Lys685 acetylation.[Display omitted]•STAT3 Tyr705 phosphorylation protects Lys685 from deacetylation.•The crystal structure of Lys685-acetylated and Tyr705-phosphorylated STAT3 is similar to the structure of non-acetylated STAT3.•Lys685 acetylation, per se, has no effect on DNA binding affinity and specificity of STAT3.
QBiC-Pred: quantitative predictions of transcription factor binding changes due to sequence variants(2019) Nucleic Acids Research. 47, W1, p. W127-W135 Abstract
Non-coding genetic variants/mutations can play functional roles in the cell by disrupting regulatory interactions between transcription factors (TFs) and their genomic target sites. For most human TFs, a myriad of DNA-binding models are available and could be used to predict the effects of DNA mutations on TF binding. However, information on the quality of these models is scarce, making it hard to evaluate the statistical significance of predicted binding changes. Here, we present QBiC-Pred, a web server for predicting quantitative TF binding changes due to nucleotide variants. QBiC-Pred uses regression models of TF binding specificity trained on high-throughput in vitro data. The training is done using ordinary least squares (OLS), and we leverage distributional results associated with OLS estimation to compute, for each predicted change in TF binding, a P-value reflecting our confidence in the predicted effect. We show that OLS models are accurate in predicting the effects of mutations on TF binding in vitro and in vivo, outperforming widely-used PWM models as well as recently developed deep learning models of specificity. QBiC-Pred takes as input mutation datasets in several formats, and it allows post-processing of the results through a user-friendly web interface. QBiC-Pred is freely available at http://qbic.genome.duke.edu.
Toward deciphering the mechanistic role of variations in the Rep1 repeat site in the transcription regulation of SNCA gene(2018) Neurogenetics. 19, 3, p. 135-144 Abstract
Short structural variants—variants other than single nucleotide polymorphisms—are hypothesized to contribute to many complex diseases, possibly by modulating gene expression. However, the molecular mechanisms by which noncoding short structural variants exert their effects on gene regulation have not been discovered. Here, we study simple sequence repeats (SSRs), a common class of short structural variants. Previously, we showed that repetitive sequences can directly influence the binding of transcription factors to their proximate recognition sites, a mechanism we termed non-consensus binding. In this study, we focus on the SSR termed Rep1, which was associated with Parkinson’s disease (PD) and has been implicated in thecis-regulation of the PD-riskSNCAgene. We show that Rep1 acts via the non-consensus binding mechanism to affect the binding of transcription factors from the GATA and ELK families to their specific sites located right next to the Rep1 repeat. Next, we performed an expression analysis to further our understanding regarding the GATA and ELK family members that are potentially relevant forSNCAtranscriptional regulation in health and disease. Our analysis indicates a potential role for GATA2, consistent with previous reports. Our study proposes non-consensus transcription factor binding as a potential mechanism through which noncoding repeat variants could exert their pathogenic effects by regulating gene expression.
(2018) iScience. 2, p. 141-147 Abstract
Primases are key enzymes involved in DNA replication. They act on single-stranded DNA and catalyze the synthesis of short RNA primers used by DNA polymerases. Here, we investigate the DNA binding and activity of the bacteriophage T7 primase using a new workflow called high-throughput primase profiling (HTPP). Using a unique combination of high-throughput binding assays and biochemical analyses, HTPP reveals a complex landscape of binding specificity and functional activity for the T7 primase, determined by sequences flanking the primase recognition site. We identified specific features, such as G/T-rich flanks, which increase primase-DNA binding up to 10-fold and, surprisingly, also increase the length of newly formed RNA (up to 3-fold). To our knowledge, variability in primer length has not been reported for this primase. We expect that applying HTPP to additional enzymes will reveal new insights into the effects of DNA sequence composition on the DNA recognition and functional activity of primases.[Display omitted]•New HTPP workflow enables high-throughput profiling of primase binding and activity•Sequence context of GTC recognition sites strongly influences binding by T7 primase•Processivity of the T7 primase is significantly affected by template sequence•T7 primase forms longer primers from templates with higher DNA-binding affinityBiochemical Mechanism; Molecular Biology; Molecular Genetics
(2016) Proceedings of the National Academy of Sciences - PNAS. 113, 47, p. E7409-E7417 Abstract
In the process of transcription elongation, RNA polymerase (RNAP) pauses at highly nonrandom positions across genomic DNA, broadly regulating transcription; however, molecular mechanisms responsible for the recognition of such pausing positions remain poorly understood. Here, using a combination of statistical mechanical modeling and high-throughput sequencing and biochemical data, we evaluate the effect of thermal fluctuations on the regulation of RNAP pausing. We demonstrate that diffusive backtracking of RNAP, which is biased by repetitive DNA sequence elements, causes transcriptional pausing. This effect stems from the increased microscopic heterogeneity of an elongation complex, and thus is entropy-dominated. This report shows a linkage between repetitive sequence elements encoded in the genome and regulation of RNAP pausing driven by thermal fluctuations.
Nonconsensus Protein Binding to Repetitive DNA Sequence Elements Significantly Affects Eukaryotic Genomes(2015) PLoS Computational Biology. 11, 8, e1004429. Abstract
Recent genome-wide experiments in different eukaryotic genomes provide an unprecedented view of transcription factor (TF) binding locations and of nucleosome occupancy. These experiments revealed that a large fraction of TF binding events occur in regions where only a small number of specific TF binding sites (TFBSs) have been detected. Furthermore, in vitro protein-DNA binding measurements performed for hundreds of TFs indicate that TFs are bound with wide range of affinities to different DNA sequences that lack known consensus motifs. These observations have thus challenged the classical picture of specific protein-DNA binding and strongly suggest the existence of additional recognition mechanisms that affect protein-DNA binding preferences. We have previously demonstrated that repetitive DNA sequence elements characterized by certain symmetries statistically affect protein-DNA binding preferences. We call this binding mechanism nonconsensus protein-DNA binding in order to emphasize the point that specific consensus TFBSs do not contribute to this effect. In this paper, using the simple statistical mechanics model developed previously, we calculate the nonconsensus protein-DNA binding free energy for the entire C. elegans and D. melanogaster genomes. Using the available chromatin immunoprecipitation followed by sequencing (ChIP-seq) results on TF-DNA binding preferences for ~100 TFs, we show that DNA sequences characterized by low predicted free energy of nonconsensus binding have statistically higher experimental TF occupancy and lower nucleosome occupancy than sequences characterized by high free energy of nonconsensus binding. This is in agreement with our previous analysis performed for the yeast genome. We suggest therefore that nonconsensus protein-DNA binding assists the formation of nucleosome-free regions, as TFs outcompete nucleosomes at genomic locations with enhanced nonconsensus binding. In addition, here we perform a new, large-scale analysis using in vitro TF-DNA preferences obtained from the universal protein binding microarrays (PBM) for ~90 eukaryotic TFs belonging to 22 different DNA-binding domain types. As a result of this new analysis, we conclude that nonconsensus protein-DNA binding is a widespread phenomenon that significantly affects protein-DNA binding preferences and need not require the presence of consensus (specific) TFBSs in order to achieve genome-wide TF-DNA binding specificity.
(2014) Proceedings of the National Academy of Sciences - PNAS. 111, 48, p. 17140-17145 Abstract
Until now, it has been reasonably assumed that specific base-pair recognition is the only mechanism controlling the specificity of transcription factor (TF)-DNA binding. Contrary to this assumption, here we show that nonspecific DNA sequences possessing certain repeat symmetries, when present outside of specific TF binding sites (TFBSs), statistically control TF-DNA binding preferences. We used high-throughput protein-DNA binding assays to measure the binding levels and free energies of binding for several human TFs to tens of thousands of short DNA sequences with varying repeat symmetries. Based on statistical mechanics modeling, we identify a new protein-DNA binding mechanism induced by DNA sequence symmetry in the absence of specific base-pair recognition, and experimentally demonstrate that this mechanism indeed governs protein-DNA binding preferences.
Positive and Negative Design for Nonconsensus Protein-DNA Binding Affinity in the Vicinity of Functional Binding Sites(2013) Biophysical Journal. 105, 7, p. 1653-1660 Abstract
Recent experiments provide an unprecedented view of protein-DNA binding in yeast and human genomes at single-nucleotide resolution. These measurements, performed over large cell populations, show quite generally that sequence-specific transcription regulators with well-defined protein-DNA consensus motifs bind only a fraction among all consensus motifs present in the genome. Alternatively, proteins in vivo often bind DNA regions lacking known consensus sequences. The rules determining whether a consensus motif is functional remain incompletely understood. Here we predict that genomic background surrounding specific protein-DNA binding motifs statistically modulates the binding of sequence-specific transcription regulators to these motifs. In particular, we show that nonconsensus protein-DNA binding in yeast is statistically enhanced, on average, around functional Reb1 motifs that are bound as compared to nonfunctional Reb1 motifs that are unbound. The landscape of nonconsensus protein-DNA binding around functional CTCF motifs in human demonstrates a more complex behavior. In particular, human genomic regions characterized by the highest CTCF occupancy, show statistically reduced level of nonconsensus protein-DNA binding. Our findings suggest that nonconsensus protein-DNA binding is fine-tuned around functional binding sites using a variety of design strategies.
Genome-Wide Organization of Eukaryotic Preinitiation Complex Is Influenced by Nonconsensus Protein-DNA Binding(2013) Biophysical Journal. 104, 5, p. 1107-1115 Abstract
Genome-wide binding preferences of the key components of eukaryotic preinitiation complex (PIC) have been recently measured at high resolution in Saccharomyces cerevisiae by Rhee and Pugh. However, the rules determining the PIC binding specificity remain poorly understood. In this study, we show that nonconsensus protein-DNA binding significantly influences PIC binding preferences. We estimate that such nonconsensus binding contributes statistically at least 2–3 kcal/mol (on average) of additional attractive free energy per protein per core-promoter region. The predicted attractive effect is particularly strong at repeated poly(dA:dT) and poly(dC:dG) tracts. Overall, the computed free-energy landscape of nonconsensus protein-DNA binding shows strong correlation with the measured genome-wide PIC occupancy. Remarkably, statistical PIC preferences of binding to both TFIID-dominated and SAGA-dominated genes correlate with the nonconsensus free-energy landscape, yet these two groups of genes are distinguishable based on the average free-energy profiles. We suggest that the predicted nonconsensus binding mechanism provides a genome-wide background for specific promoter elements, such as transcription-factor binding sites, TATA-like elements, and specific binding of the PIC components to nucleosomes. We also show that nonconsensus binding has genome-wide influence on transcriptional frequency.
(2012) Biophysical Journal. 102, 8, p. 1881-1888 Abstract
Recent genome-wide measurements of binding preferences of ~200 transcription regulators in the vicinity of transcription start sites in yeast, have provided a unique insight into the cis-regulatory code of a eukaryotic genome. Here, we show that nonspecific transcription factor (TF)-DNA binding significantly influences binding preferences of the majority of transcription regulators in promoter regions of the yeast genome. We show that promoters of SAGA-dominated and TFIID-dominated genes can be statistically distinguished based on the landscape of nonspecific protein-DNA binding free energy. In particular, we predict that promoters of SAGA-dominated genes possess wider regions of reduced free energy compared to promoters of TFIID-dominated genes. We also show that specific and nonspecific TF-DNA binding are functionally linked and cooperatively influence gene expression in yeast. Our results suggest that nonspecific TF-DNA binding is intrinsically encoded into the yeast genome, and it may play a more important role in transcriptional regulation than previously thought.
(2011) Biophysical Journal. 101, 10, p. 2465-2475 Abstract
Quantitative understanding of the principles regulating nucleosome occupancy on a genome-wide level is a central issue in eukaryotic genomics. Here, we address this question using budding yeast, Saccharomyces cerevisiae, as a model organism. We perform a genome-wide computational analysis of the nonspecific transcription factor (TF)-DNA binding free-energy landscape and compare this landscape with experimentally determined nucleosome-binding preferences. We show that DNA regions with enhanced nonspecific TF-DNA binding are statistically significantly depleted of nucleosomes. We suggest therefore that the competition between TFs with histones for nonspecific binding to genomic sequences might be an important mechanism influencing nucleosome-binding preferences in vivo. We also predict that poly(dA:dT) and poly(dC:dG) tracts represent genomic elements with the strongest propensity for nonspecific TF-DNA binding, thus allowing TFs to outcompete nucleosomes at these elements. Our results suggest that nonspecific TF-DNA binding might provide a barrier for statistical positioning of nucleosomes throughout the yeast genome. We predict that the strength of this barrier increases with the concentration of DNA binding proteins in a cell. We discuss the connection of the proposed mechanism with the recently discovered pathway of active nucleosome reconstitution.
(2011) The Journal of chemical physics. 135, 6, 065104. Abstract
We predict analytically that diagonal correlations of amino acid positions within protein sequences statistically enhance protein propensity for nonspecific binding. We use the term “promiscuity” to describe such nonspecific binding. Diagonal correlations represent statistically significant repeats of sequence patterns where amino acids of the same type are clustered together. The predicted effect is qualitatively robust with respect to the form of the microscopic interaction potentials and the average amino acid composition. Our analytical results provide an explanation for the enhanced diagonal correlations observed in hubs of eukaryotic organismal proteomes [J. Mol. Biol. 409, 439 (2011)] https://doi.org/10.1016/j.jmb.2011.03.056.. We suggest experiments that will allow direct testing of the predicted effect.
(2011) Journal of Molecular Biology. 409, 3, p. 439-449 Abstract
Numerous experiments demonstrate a high level of promiscuity and structural disorder in organismal proteomes. Here, we ask the question what makes a protein promiscuous, that is, prone to nonspecific interactions, and structurally disordered. We predict that multi-scale correlations of amino acid positions within protein sequences statistically enhance the propensity for promiscuous intra- and inter-protein binding. We show that sequence correlations between amino acids of the same type are statistically enhanced in structurally disordered proteins and in hubs of organismal proteomes. We also show that structurally disordered proteins possess a significantly higher degree of sequence order than structurally ordered proteins. We develop an analytical theory for this effect and predict the robustness of our conclusions with respect to the amino acid composition and the form of the microscopic potential between the interacting sequences. Our findings have implications for understanding molecular mechanisms of protein aggregation diseases induced by the extension of sequence repeats.