Machine Learning for Single Cell Genomics

Methods for profiling the molecular content of individual cells at high throughput (collectively known as single cell genomics) provide a powerful and increasingly popular way for studying biology – from questions of basic science (e.g., how cells respond to certain stimulations) to translational applications (e.g., stratifying patient populations or screening for drug targets). Realizing this in practice, however, is challenging. The first challenge is conceptual – given the ever-growing richness of single cell assays (e.g., profiling different molecule types, employing labeling strategies), there is constant need for envisioning ways in which  insight could be drawn. The second challenge is technical - the ensuing data is affected by a plethora of confounders, which make it difficult to process and to make sense out of. This raises the need for algorithms that address these imperfections and are able to estimate our confidence. 

Over the past couple of years, we have been developing computational tools that build and extend upon advances in statistical machine learning and other disciplines to offer new ways to draw insight from single cell genomics and, at the same time, model bias and uncertainty in the data. Examples from this work include: identification and annotation of cell states, integration of samples across studies and technologies, feature selection, comparative analysis, joint representation of multi-modal measurements (e.g., chromatin and RNA), analysis of sample-level variation in cohort studies, and visualization and interactive exploration. We have also developed ways to harness the variation between cells in order to infer models that describe molecular processes, with a focus on metabolic flux, transcription, and clonal expansion. 

Our group is committed to making our methods accessible via open source software. Our software packages cover a range of functionalities and data modalities in single cell genomics and other areas of genomics. In one of our major efforts, we developed scvi-tools - a software suite for deep probabilistic modeling for single cell genomics (scvi stands for Single Cell Variational Inference). scvi-tools covers a variety of functionalities for different data types (chromatin, RNA, proteins, DNA methylation, spatial assays) and analysis scenarios (e.g., cohort studies, comparative analysis) that are based on models developed by us or contributed to our code base by others. The primary goal of scvi-tools is to make probabilistic analysis for single cell genomics readily accessible. A second goal is to provide a development environment for new models to be used by other groups and establish the corresponding software repository. We aim to achieve these goals by making available a variety of  tutorials and user guides and by adhering to software standards, which we help establish in our role as members of the scverse consortium.

scVI tools

Relevant publications

  • Consensus prediction of cell type labels with popV. C. Ergen, G. Xing, C. Xu, M. Jayasuriya, E. McGeever, A.O. Pisco, A. Streets, N. Yosef (bioRxiv)
     
  • Calibrated Identification of Feature Dependencies in Single-cell Multiomics 
    P. Boyeau, S. Bates, C. Ergen, MI. Jordan, N. Yosef (bioRxiv)
     
  • Deep generative modeling for quantifying sample-level heterogeneity in single-cell omics. P. Boyeau*, J. Hong*, A. Gayoso, M. Jordan, E. Azizi, C. Ergen, N. Yosef (bioRxiv)
     
  • Scvi-hub: an actionable repository for model-driven single cell analysis C. Ergen, VV. Pour Amiri, M. Kim, A. Streets, A. Gayoso, N. Yosef (bioRxiv)

  • AutoEval Done Right: Using Synthetic Data for Model Evaluation. P. Boyeau, AN. Angelopoulos, N. Yosef, J. Malik, MI. Jordan (arXiv)
     
  • Deep generative modeling of transcriptional dynamics for RNA velocity analysis in single cells. A. Gayoso*, P. Weiler*, M. Lotfollahi, D. Klein, J. Hong, A. Streets, FJ. Theis†, N. Yosef†, Nature Methods, 2024
     
  • NicheVI: A probabilistic framework to embed cellular interaction in spatial transcriptomics. N. Levy, F. Ingelfinger, P. Boyeau, B. Nadler, C. Ergen, N. Yosef. ICLR workshop on Machine Learning for Genomics Explorations, 2024
     
  • The scverse project provides a computational ecosystem for single-cell omics data analysis. Virshup I*, Bredikhin D*, Heumos L*, Palla G*, Sturm G*, Gayoso A*, Kats I, Koutrouli M; Scverse Community; Berger B, Pe’er D, Regev A, Teichmann SA, Finotello F†, Wolf FA†, Yosef N†, Stegle O†, Theis FJ† Nature Biotechnology, 2023
     
  • MultiVI: deep generative model for the integration of multi-modal data.T. Ashuach*, M. Gabitto*, M. Jordan, N. Yosef. Nature Methods, 2023
     
  • An Empirical Bayes Method for Differential Expression Analysis of Single Cells with Deep Generative Models. P. Boyeau, J. Regier, A. Gayoso, M.I. Jordan, R. Lopez, N. Yosef. Proceedings of the National Academy of Science, 202
     
  • The Tabula Sapiens: a single cell transcriptomic atlas of multiple organs from individual human donors. The Tabula Sapiens Consortium. Science, 202
     
  • Identifying cell-state associated alternative splicing events and their co-regulationCF. Buen Abad Najar, P. Burra, N. Yosef†, LF. Lareau† Genome Research, 2022
     
  • A phython library for deep probabilistic analysis of single-cell omics data.
    A. Gayoso*, R. Lopez*, G. Xing*, P. Boyeau, K. Wu, M. Jayasuriya, E. Melhman, M. Langevin, Y. Liu, J. Samaran, G. Misrachi, A. Nazaret, O. Clivio, C. Xu, T. Ashuach, M. Lotfollahi, V.Svensson, E. da Veiga Beltrame, C. Talavera-Lopez, L. Pachter, F.J. Theis, A. Streets, M.I. Jordan, J. Regier, N. Yosef. Nature Biotechnology, 2022
     
  • Multi-resolution deconvolution of spatial transcriptomics data reveals continuous patterns of inflammation. R. Lopez, B. Li, H. Keren-Shaul, P. Boyeau, M. Kedmi, D. Pilzer, A.Jelinski, E. David, A. Wagner, Y. Addadi, M.I. Jordan, I. Amit †, N. Yosef †. Nature Biotechnology, 2022
     
  • PeakVI: A Deep Generative Model for Single Cell Chromatin Accessibility Analysis. T. Ashuach, DA. Reidenbach, A. Gayoso, N. Yosef. Cell Reports Methods, 2022
     
  • Identifying systematic variation at the single-cell level by leveraging low-resolution population-level data. E. Rahmani, MI. Jordan, N. Yosef RECOMB , 2022 
     
  • Joint probabilistic modeling of paired transcriptome and proteome measurements in single cells. A. Gayoso*, Z. Steier*, R. Lopez, J. Regier, KL. Nazor, A. Streets N Yosef Nature Methods 2021; 18(3):272-282. doi: 10.1038/s41592-020-01050-x
     
  • Probablistic harmonization and Annotation of Single-cell Transcriptomics data with Deep Generative Models. C. Xu*, R. Lopez*, E. Mehlman*, J. Regier, M.I. Jordan, N. Yosef  . Molecular Systems Biology 2021; 17:e9620, doi.org/10.15252/msb.20209620
     
  • Epitome: Predicting epigenetic events in novel cell types with multi-cell deep ensemble learning. AK. Morrow, JW. Hughes, J. Singh, AD. Joseph, N. Yosef. Nucleic Acids Research 2021
     
  • Identifying Informative Gene Modules Across Modalities of Single Cell Genomics. D. DeTomaso and N. YosefCell Systems, 2021; 2(5):446-456.e9. doi: 10.1016/j.cels.2021.04.005
     
  • Enhancing Scientific Discoveries in Molecular Biology with Deep Generative Models. R. Lopez, A. Gayoso & N. Yosef. Molecular Systems Biology 2020. doi:10.15252/msb.20199198
     
  • Coverage-dependent bias creates the appearance of binary splicing in single cells CF. Buen Abad Najar, N. Yosef , LF. Lareau   eLife 2020; 9:e54603 DOI: 10.7554/eLife.54603
     
  • Functional Interpretation of Single-Cell Similarity Maps. D. Detomaso*, M. Jones*. M. Subramanian, J. Ye, N. Yosef  Nature Communications 2019. 10(1):4376. doi: 10.1038/s41467-019-12235-0
     
  • Reconstructing B cell receptor sequences from short-read single cell RNA-sequencing with BRAPeS. S. Afik, G. Raulet, and N. YosefLife Science Alliance 2019 2(4). doi: 10.26508/lsa.201900371
     
  • SymSim: simulating multi- faceted variability in Single Cell RNA sequencing. X. Zhang, C. Xu, N. Yosef. Nature Communications 2019; 10(1):2611. doi: 10.1038/s41467-019-10500-w
     
  • Performance Assessment and Selection of Normalization Procedures for Single-Cell RNA-Seq/ MB. Cole, D. Risso, A. Wagner, D. DeTomaso, J. Ngai, E. Purdom, S. Dudoit†, N. YosefCell Systems 2019; 8(4):315-328.e8. doi: 10.1016/j.cels.2019.03.010
     
  • Connectivity Problems on Heterogeneous Graphs/ J. Wu, A. Khodaverdian, B. Weitz, N. Yosef. Algorithms Mol Biol. 2019; 14: 5. doi: 10.1186/s13015-019-0141-z
     
  • A joint model of unpaired data from scRNA-seq and spatial transcriptomics for imputing missing gene expression measurements. R. Lopez *, A. Nazaret *, M. Langevin *, J. Samaran *, J. Regier *, M.I. Jordan, N. YosefICML 2019 Workshop on Computational Biology
     
  • Detecting Zero-Inflated Genes in Single-Cell Transcriptomics Data O. Clivio, R. Lopez, J. Regier, A. Gayoso, MI. Jordan, N Yosef. MLCB 2019
     
  • Deep Generative Modeling for Single-cell Transcriptomics/ R. Lopez, J. Regier, MB. Cole, M. Jordan, N. Yosef. Nature Methods 2018; 15(12):1053-8. doi: 10.1038/s41592-018-0229-2
     
  • Impulse model-based differential expression analysis of time course sequencing data/ DS Fischer,  FJ Theis, N. Yosef Nucleic Acids Research 2018; 46(20):e119. doi: 10.1093/nar/gky675

  • Targeted reconstruction of T cell receptor sequence from single cell RNA-seq links CDR3 length to differentiation state/ S. Afik, K. Yates, K Bi, S. Darko, J. Godec, U. Gerdemann, L. Swadling, DC. Douek, P. Klenerman, EJ. Barnes, AH. Sharpe, N. Haining†, N. Yosef†.  Nucleic Acids Research 2017. doi.org/10.1093/nar/gkx615

  • ImpulseDE: detection of differentially expressed genes in time series data using impulse models/ J. Sander, J. Schultze, N. Yosef. Bioinformatics  2016. pii: btw665. 

  • FastProject: A Tool for Low-Dimensional Analysis of Single-Cell RNA-Seq Data/ D. DeTomaso, N. Yosef. BMC Bioinformatics 2016.17(1):315. doi: 10.1186/s12859-016-1176-5