Sorting Points Into Neighborhoods (SPIN)

Abstract

Exploratory data analysis is critical in a broad range of research areas, where large collections of data need to be meaningfully arranged and presented. I have developed SPIN, a novel method for the organization and visualization of data, implemented in a simple tool. SPIN utilizes traits of distance matrices to sort objects in a natural ordering that highlights the underlying structure of the original, multidimensional data. The relationships between objects can be inferred from the reordered distance matrix generated by SPIN. As an unsupervised analysis tool, SPIN does not rely on any external labels, but rather explores the inherent characteristics of the data. SPIN has been successfully utilized in the analysis of high-throughput biological experiments. In such experiments discretely-labelled data, such as clinical labels of 'sick' versus 'healthy', is traditionally organized by various clustering approaches. However, when the objects are characterized by continuous variables, e.g. survival intervals of patients or expression levels of genes, any sharp separation into distinct clusters will be rather arbitrary. Thus, a different organization approach, one which emphasizes ordering rather than grouping, could be more relevant. In several cases, the structure uncovered by SPIN has a clear biological interpretation, such as the cyclic nature of cell-cycle progression, visualized in a ring conformation. In another example the tissue composition of tested samples is captured by their relative placement in an ordered elongated cluster, formed in the space of tissue specific genes. Finally, the general applicability of SPIN makes it relevant to diverse scientific disciplines.

The method

Intuition

Features of sorted distance matrices: Unsorted (middle row) and sorted (bottom row) distance matrices of the 4 simple objects appearing on the top row

Clustering Vs. Sorting

SPIN Hierarchical Clustering
(a) A toy example. (b) Single-linkage dendrogram of the object. (c) The distance matrix sorted according to the dendrogram. (d) The distance matrix after sorting by SPIN

Intersecting rods in 7 dimensions

Seven orthogonal cylinders in 7 dimensions, twisted with angles that increase linearly with the distance from the origin.

The colors of the points, ranging from dark blue to dark, reflect their random order.
The same object reordered by SPIN: the coloring in the PCA is according to the position in the distance matrix.
A simplified version composed of three straight intersecting rods in three dimensions. The numbered arrows illustrate the order imposed by SPIN. The region of the intersection creates blue patches in the off-diagonal regions of the distance matrix (denoted by a)

Examples

Colorectal Cancer

Gene expression profiles across several stages of the disease, from normal colon, through adenoma and carcinoma all the way to metastasis. Data taken from: Tsafrir et al., AACR 2004.

1000 highest variance genes over the 144 samples. (a) original unsorted expression. (b) sorted expression and (c) distance matrix after applying SPIN.

Cell cycle

Yeast expression data taken from: Spellman et al., Molecular Biology of the Cell 9, 3273-3297 (1998). (a) Expression matrix obtained by sorting the genes using SPIN and ordering the samples according to time. (b) Sorted distance matrix reveals the interplay between genes associated with different stages of cell-cycle. (c) Projection of genes on the first and second PCA.

We are currently in the process of filing a patent application for SPIN.