otscomics

otscomics.C_index(D: ndarray, clusters: ndarray) → float

Compute the C index, a measure of how well the pairwise distances reflect ground truth clusters. Implemented here for reference, but the silhouette score (aka Average Silhouette Width) is a more standard metric for this.

Parameters

D (np.ndarray) – The pairwise distances.
clusters (np.ndarray) – The ground truth clusters.

Returns

The C index.

Return type

float

otscomics.OT_distance_matrix(data: ndarray, cost: ndarray, eps: float = 0.1, dtype: torch.dtype = torch.double, device: str = 'cuda', divide_max: bool = False, numItermax: int = 500, stopThr: float = 1e-05, batch_size: int = 200) → ndarray

Compute the pairwise Optimal Transport distance matrix. We compute Sinkhorn Divergences using POT’s implementation of the Sinkhorn algorithm. Computations are done using PyTorch on a specified device. But the result is a numpy array. This allows not saturating the GPU for large matrices.

Parameters

data (np.ndarray) – The input data, as a numpy array.
cost (np.ndarray) – The ground cost between features.
eps (float, optional) – The entropic regularization parameter. Small regularization requires more iterations and double precision. Defaults to .1.
dtype (torch.dtype, optional) – The torch dtype used for computations. Double is more precise but takes up more space. Defaults to torch.double.
device (str, optional) – The torch device to compute on, typically ‘cpu’ or ‘cuda’. Defaults to ‘cuda’.
divide_max (bool, optional) – Whether to divide the resulting matrix by its maximum value. This can be useful to compare matrices. Defaults to False.
numItermax (int, optional) – Used by POT, maximum number of Sinkhorn iterations. Defaults to 500.
stopThr (float, optional) – Used by POT, tolerance for early stopping in the Sinkhorn iterations. Defaults to 1e-5.
batch_size (int, optional) – The batch size, i.e. how many distances can be computed at the same time. Should be as large as possible on your hardware. Defaults to 200.

Returns

The pairwise OT distance matrix.

Return type

np.ndarray

otscomics.cost_matrix(data: ndarray, cost: str = 'correlation', normalize_features: bool = True) → ndarray

Compute an empirical ground cost matrix, i.e. a pairwise distance matrix between the rows of the dataset (l1-normalized by default). Accepted distances are the ones compatible with Scipy’s cdist.

Parameters

data (np.ndarray) – The input data, samples as columns and features as rows.
cost (str) – The metric use. Defaults to correlation.
normalize_features (bool, optional) – Whether to divide the rows by their sum before computing distances. Defaults to True.

Returns

The pairwise cost matrix.

Return type

np.ndarray