Methods

This page describes the methods for curating 3CA, defining meta-programs and generating the figures on this website. More details can be found in Gavish et al.1, and code is available at the 3ca GitHub repository2.

Data curation

A total of 77 Single cell RNA-seq (scRNA-seq) datasets, which constitute 1456 samples/tumors, were curated for this project. Some of these have not been made available on the initial version of this website, namely unpublished datasets and datasets for which we did not obtain permission from the authors of the original studies. This cohort will be expanded continuously by adding new published scRNA-seq datasets.  

We first constructed a list of potentially relevant studies through pubmed search, and continuously updated it through literature review. Each study was then examined for the type and amount of scRNA-seq data generated in it, prioritizing studies with data for multiple patient tumors and including a reasonable fraction of malignant cells. When seeking to add each new dataset to our cohort, we initially checked whether the data is publicly available for downloading (e.g. via the Gene Expression Omnibus database repository). Most publications freely provided the data in the form of an expression-matrix, together with a list of associated genes and cells (barcodes). Several datasets were only available upon author consent, is which case we contacted the authors for permission. The cohort also includes unpublished and published datasets that were previously analyzed in our lab, and that were already sequenced, aligned and processed by our collaborators or ourselves.

Most of the external datasets we downloaded, even when freely available, did not include the cell annotations presented in the published manuscript. In these cases, we contacted the leading authors and requested that they provide us with the annotations they used. In many cases, annotations of the original study were either not provided by the authors or were limited (e.g. not distinguishing between malignant and non-malignant epithelial cells), in which case we inferred the annotations ourselves. 

Verification of cell annotation 

For each dataset – both for author-based annotations and for the annotations that we defined – we performed the following analyses to ensure the validity of the annotations. First, we performed dimensionality reduction using UMAP and examined whether the cells annotated as different cell types clustered separately. Second, we validated the annotations by verifying that the top differentially expressed genes of each cell type match known marker genes. Finally, we inferred Copy Number Aberrations (CNAs) (using the package available at https://github.com/jlaffy/infercna), to verify that the cells annotated as cancer are indeed malignant. Some samples in which we could not resolve the annotations, possibly due to low data quality, were excluded during this process and are not shown in the website. 

Format for annotated scRNA-seq datasets

To facilitate the utility and uniformity of the data, we employed the following consistent file formats for storing annotated scRNA-seq data:

  1. The scRNA-seq expression matrix, with genes as the rows and cells as the columns. Since expression matrices are typically sparse, they were saved in mtx format which occupies less memory space. The suffix of the file names indicates whether the expression data is in units of TPM or UMI (e.g. for data sequenced by smart seq2 or 10X platforms, respectively).
  2. The Genes.txt files contains the genes names corresponding to the matrix rows.  
  3. The Cells.txt files contains the cell names/barcodes, the samples to which each cell belongs and the cell annotations. In some cases where all cells were presumably malignant (e.g. in cell lines), the annotations were not specified. Extra information on each sample is sometimes provided when available (e.g. gender, treatment status etc.).

Data pre-processing

Before conducting downstream analysis, we performed the following preprocessing steps:

  1. Cell filtering: We excluded cells with a low number of detected genes (#genes). For 10X data our cutoff was typically #genes < 1000. For smart-seq2 data we typically used a higher cutoff of #genes < 2000 genes. For other methods we adapted the cutoff and in some cases used lower that 1000 genes, but never lower than the cutoff used by the original the study.
  2. Sample filtering: After cell filtering, we excluded samples having fewer than 10 malignant cells. Where relevant, we also excluded samples with unclear CNA patterns. In total, 1163 samples were used for downstream analysis.
  3. Gene filtering: For most analysis we kept the top ~7000 genes with the highest mean across all cells.
  4. Normalization: UMI counts were converted to CPM (counts per million). For most analysis each entry in the matrix was then normalized according to x=log2(x/10+1). The values were divided by 10 since the actual complexity is assumed to be in the realm of ~100000 and not 1 million as implied by the CPM and TPM measures.
  5. For most analysis the data was centered (each gene was centered across all cells). Centering was done separately for each study.  

Defining meta-programs (MPs)  

Our goal was to define recurrent meta-programs of intra-tumor heterogeneity (ITH) shared across multiple tumors of different types. To this end, we performed Non-negative matrix factorization (NMF) for each sample separately, to generate programs that capture the heterogeneity with each sample. Since application of NMF requires a “K” parameter that influences the results, we ran NMF using different values (K=4,5,6,7,8,9), thereby generating 39 programs for each tumor. Each NMF program was summarized by the top 50 genes based on NMF coefficients. We reasoned that the most meaningful NMF programs are those that would recur across different values of K as well as across tumors; such programs (denoted here as robust NMF programs) were defined by the following three criteria:

  1. Recurs within the tumor: NMF programs that were represented in more than one K value. Two NMF programs were considered as similar if they had at least 70% overlap (shared 35 out of 50 genes).
  2. Recurs across tumors: NMF programs that had at least 20% similarity with any NMF program in any of the other tumors analyzed. 
  3. Non-redundant within the tumor: within a tumor, NMF programs were ranked by their similarity (gene overlap) with NMFs from other tumors and selected in decreasing order. Once an NMF was selected, any other NMF within the tumor that had 20% overlap (or more) with the selected NMF was removed, to avoid redundancy. 
     

This approach yielded 5547 robust NMF programs.

We next clustered the robust NMF programs according to Jaccard similarity. The clustering was performed using a custom written R program, so that the ultimate output not only defined the NMFs that constituted each cluster, but also determined the 50 genes defining the MP shared between the NMFs in the cluster. This approach yielded 67 MPs. We further removed MPs that were: (i) suspected to reflect low quality data and had a strong enrichment of either ribosomal protein genes or mitochondrial-encoded genes, or (ii) included NMF programs from only a single study, even though that study reflected a cancer type that is also examined by additional studies, or (iii) suspected to reflect doublet cells as it was highly similar to the expression profile of a non-malignant cell type (e.g. T-cells or macrophages). We retained 41 MPs that were further grouped by functional similarities into 11 hallmarks.

(See the Methods section of Gavish et al.1 for more details).

Website figures

The UMAP plots for each dataset were generated as follows. First, the dataset was filtered by removing genes with average log (base 2) TPM or CPM of less than 4. Where relevant, this was done per sample, and the union of all surviving genes across samples was used for the final filtered matrix. This matrix was then centered by gene, and from this, 50 estimated principal components were computed using IRLBA (via the irlba R package). UMAP was applied to these 50 estimated PCs, with the n_neighbors parameter varying according to the number of cells in the dataset.

The marker gene plots were generated using a manually curated list of marker genes. For each dataset, only those cell types having at least 10 cells were included in the plot. Mean expression levels and percentage of expressing cells were computed per cell type.
 

References

  1. Avishai Gavish, Michael Tyler, Alissa C. Greenwald, Rouven Hoefflin, Dor Simkin, Roi Tschernichovsky, Noam Galili Darnell, Einav Somech, Chaya Barbolin, Tomer Antman, Daniel Kovarsky, Thomas Barrett, L. Nicolas Gonzalez Castro, Debdatta Halder, Rony Chanoch-Myers, Julie Laffy, Michael Mints, Adi Wider, Rotem Tal, Avishay Spitzer, Toshiro Hara, Maria Raitses-Gurevich, Chani Stossel, Talia Golan, Amit Tirosh, Mario L. Suvà, Sidharth V. Puram, and Itay Tirosh. Hallmarks of Transcriptional Intratumour Heterogeneity across a Thousand Tumours. Nature 1–9 (2023). doi:10.1038/s41586-023-06130-4.

  2. Michael Tyler, Avishai Gavish. (2023). tiroshlab/3ca: First release (v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.7688626