otscomics
- otscomics.C_index(D: ndarray, clusters: ndarray) float
Compute the C index, a measure of how well the pairwise distances reflect ground truth clusters. Implemented here for reference, but the silhouette score (aka Average Silhouette Width) is a more standard metric for this.
- Parameters
D (np.ndarray) – The pairwise distances.
clusters (np.ndarray) – The ground truth clusters.
- Returns
The C index.
- Return type
float
- otscomics.OT_distance_matrix(data: ndarray, cost: ndarray, eps: float = 0.1, dtype: torch.dtype = torch.double, device: str = 'cuda', divide_max: bool = False, numItermax: int = 500, stopThr: float = 1e-05, batch_size: int = 200) ndarray
Compute the pairwise Optimal Transport distance matrix. We compute Sinkhorn Divergences using POT’s implementation of the Sinkhorn algorithm. Computations are done using PyTorch on a specified device. But the result is a numpy array. This allows not saturating the GPU for large matrices.
- Parameters
data (np.ndarray) – The input data, as a numpy array.
cost (np.ndarray) – The ground cost between features.
eps (float, optional) – The entropic regularization parameter. Small regularization requires more iterations and double precision. Defaults to .1.
dtype (torch.dtype, optional) – The torch dtype used for computations. Double is more precise but takes up more space. Defaults to torch.double.
device (str, optional) – The torch device to compute on, typically ‘cpu’ or ‘cuda’. Defaults to ‘cuda’.
divide_max (bool, optional) – Whether to divide the resulting matrix by its maximum value. This can be useful to compare matrices. Defaults to False.
numItermax (int, optional) – Used by POT, maximum number of Sinkhorn iterations. Defaults to 500.
stopThr (float, optional) – Used by POT, tolerance for early stopping in the Sinkhorn iterations. Defaults to 1e-5.
batch_size (int, optional) – The batch size, i.e. how many distances can be computed at the same time. Should be as large as possible on your hardware. Defaults to 200.
- Returns
The pairwise OT distance matrix.
- Return type
np.ndarray
- otscomics.cost_matrix(data: ndarray, cost: str = 'correlation', normalize_features: bool = True) ndarray
Compute an empirical ground cost matrix, i.e. a pairwise distance matrix between the rows of the dataset (l1-normalized by default). Accepted distances are the ones compatible with Scipy’s cdist.
- Parameters
data (np.ndarray) – The input data, samples as columns and features as rows.
cost (str) – The metric use. Defaults to correlation.
normalize_features (bool, optional) – Whether to divide the rows by their sum before computing distances. Defaults to True.
- Returns
The pairwise cost matrix.
- Return type
np.ndarray