ComSegDict

class comseg.dictionary.ComSegDict(dataset=None, mean_cell_diameter=None, community_detection='with_prior', seed=None, disable_tqdm=False)

Bases: object

As a dataset is often composed of many separated images. It is required to create many ComSeg graphs of RNAs. To ease the analysis of entire dataset, we implement ComSegDict. It is a class that store many ComSeg object and allows to perform analysis at the dataset scale. This class is implemented as a dictionary of ComSeg graph object

__init__(dataset=None, mean_cell_diameter=None, community_detection='with_prior', seed=None, disable_tqdm=False)
Parameters:
  • dataset (ComSegDataset)

  • mean_cell_diameter (float) – the expected mean cell diameter in µm default is 15µm

  • community_detection (str) – choose in [“with_prior”, “louvain”], “with_prior” is our graph partitioning / community detection method taking into account prior knowledge

  • seed (int) – (optional) seed for the graph partitioning initialization

  • prior_name – (optional) Name of the prior cell assignment column the input CSV file. Node with the same prior label will be merged into a super node.

node with different prior label can not be merged during the modularity optimization. :type prior_name: str

concatenate_anndata()

concatenate all community expression vectors from all the ComSeg graphs into a single anndata object

Returns:

anndata

Return type:

AnnData

compute_community_vector(k_nearest_neighbors: int = 10)

for all the images in the dataset, this function creates a graph of RNAs and compute the community vectors

Parameters:
  • self

  • k_nearest_neighbors (int) – number of nearest neighbors to consider for the graph creation

Returns:

compute_insitu_clustering(size_commu_min=3, norm_vector=False, n_pcs=3, n_comps=3, clustering_method='leiden', n_neighbors=20, resolution=1, n_clusters_kmeans=4, palette=None, nb_min_cluster=0, min_merge_correlation=0.8, merge_cluster=True, sample_size_for_scTranform=5000)

Cluster all together the RNA partition/community expression vector for all the images in the dataset and identify the single cell transcriptomic cluster present in the dataset

#todo clean the name leiden vs leiden_merged aka clustering_method

#todo or add the current cleuter name to use in the self so it is reuse in add_cluster_id_to_graph

Parameters:
  • size_commu_min (int) – minimum number of RNA in a community to be considered for the clustering

  • norm_vector (bool) – if True, the expression vector will be normalized using the scTRANSFORM normalization parameters

  • n_pcs (int) – number of principal component to compute for the clustering; Lets 0 if no pca

  • n_comps (int) – number of components to compute for the clustering; Lets 0 if no pca

  • clustering_method (str) – choose in [“leiden”, “kmeans”, “louvain”]

  • n_neighbors (int) – number of neighbors similarity graph

  • resolution (float) – resolution parameter for the leiden/Louvain clustering

  • n_clusters_kmeans (int) – number of cluster for the kmeans clustering

  • palette (list[str]) – color palette for the cluster list of (HEX) color

  • merge_cluster – if True, the clusters with a correlation > min_merge_correlation will be merged and the clustering method is renamed {clustering_method}_merged

  • min_merge_correlation (float) – minimum correlation to merge cluster

Returns:

add_cluster_id_to_graph(clustering_method='leiden_merged')

Add transcriptional cluster id to each RNA molecule in the graph

Parameters:
  • self

  • clustering_method (str) – clustering method used to get the community (kmeans, leiden_merged, louvain)

Returns:

classify_centroid(path_cell_centroid=None, n_neighbors=15, dict_in_pixel=True, max_dist_centroid=None, key_pred='leiden_merged', distance='ngb_distance_weights', file_extension='tiff.npy', centroid_csv_key={'x': 'x', 'y': 'y', 'z': 'z'})

Classify cell centroids based on their centroid neighbors RNA label from add_cluster_id_to_graph()

Parameters:

path_dict_cell_centroid – If computed already by the ComSegDataset class from prior Maks leave it None.

Otherwisepath_dict_cell_centroid is a Path to the folder containing the centroid dictionary {self.prior_name{z:,y:,x:}} for each image.

Each centroid dictionary has to be stored in a file in a npy format, named as the image name. centroid can also be stored in a csv file with the following columns: “x”, “y”, “z”, “self.prior_name” where prior name are the cell index

Parameters:
  • n_neighbors (int) – number of neighbors to consider for the classification of the centroid (default 15)

  • dict_in_pixel (bool) – if True the centroid are in the same scale than the input csv of spots coorrdinates and rescale with dict_scale if False the centroid are in um (default True)

  • max_dist_centroid (int) – maximum distance to consider for the centroid (default None)

  • key_pred (str) – key of the node attribute containing the cluster id (default “leiden_merged”)

  • convex_hull_centroid (bool) – check that cell centroid is in the convex hull of its RNA neighbors (default True). If not the cell centroid is not classify to avoid artefact misclassification

  • file_extension (str) – file extension of the centroid dictionary

  • centroid_csv_key (dict) – column name of the centroid csv file

Returns:

associate_rna2landmark(key_pred='leiden_merged', distance='distance', max_cell_radius=100)

Associate RNAs to landmarks based on the both transcriptomic landscape and the distance between the RNAs and the centroids of the landmark

Parameters:
  • key_pred (str) – key of the node attribute containing the cluster id (default “leiden_merged”)

  • super_node_prior_key (str)

  • max_distance (float) – maximum distance between a cell centroid and an RNA to be associated (default 100)

Returns:

anndata_from_comseg_result(config: dict = None, min_rna_per_cell=5, return_polygon=True, alpha=0.5, allow_disconnected_polygon=False)

Return an anndata with the estimated expression vector of each cell in the dataset plus the spot positions.

Parameters:
  • self

  • config (dict) – dictionary of parameters to overwrite the default parameters, default is None

  • min_rna_per_cell (int) – minimum number of RNA to consider a cell

  • return_polygon (bool) – if True return the polygon of the cells, the polygon are computed using the alphashape library

  • alpha (float) – alpha parameter to compute the alphashape polygone : https://pypi.org/project/alphashape/. alpha is between 0 and 1, 1 correspond to the convex hull of the cell

  • allow_disconnected_polygon – if True allow disconnected polygon

Returns:

run_all(config: dict = None, k_nearest_neighbors: int = 10, max_cell_radius: float = 15, size_commu_min: int = 3, norm_vector: bool = False, n_pcs: int = 3, clustering_method: str = 'leiden', n_neighbors: int = 20, resolution: float = 1, n_clusters_kmeans=4, nb_min_cluster: int = 0, min_merge_correlation: float = 0.8, path_dataset_folder_centroid: str = None, file_extension: str = '.csv', disable_tqdm=False)

function running all the ComSeg steps: (compute_community_vector(), compute_insitu_clustering(), add_cluster_id_to_graph(), classify_centroid(), associate_rna2landmark() ) :param config: dictionary of parameters to overwrite the default parameters, default is None :type config: dict :param k_nearest_neighbors: number of nearest neighbors to consider for the KNN graph creation, reduce K to speed computation :type k_nearest_neighbors: int :param max_cell_radius: maximum distance between a cell centroid and an RNA to be associated :type max_cell_radius: float :param size_commu_min: minimum number of RNA in a community to be considered for the clustering (default 3) :type size_commu_min: int :param norm_vector: if True, the expression vector will be normalized using the scTRANSFORM normalization parameters, the normaliztion requires the following R package : sctransform, feather, arrow The normalization is important to do on dataset with a high number of gene. :type norm_vector: bool :param n_pcs: number of principal component to compute for the clustering of the RNA communities expression vector; Lets 0 if no pca :type n_pcs: int :param clustering_method: choose in [“leiden”, “kmeans”, “louvain”] :type clustering_method: str :param n_neighbors: number of neighbors similarity graph of the RNA communities expression vector clustering :type n_neighbors: int :param resolution: resolution paramter for the in-situ-clustering step if louvain or leiden are used :type resolution: float :param n_clusters_kmeans: number of cluster for the kmeans clustering for ```clustering_method```= “kmeans” :type n_clusters_kmeans: int :param nb_min_cluster: minimum number of cluster to keep after the merge of the cluster :type nb_min_cluster: int :param min_merge_correlation: minimum correlation to merge cluster in the in situ clustering :type min_merge_correlation: float :param path_dataset_folder_centroid: path to the folder containing the centroid in a csv or dictionary {cell : {z:,y:,x:}} for each image, use the same scale than then input csv :type path_dataset_folder_centroid: str :param file_extension: file extension of the centroid dictionary (.npy) or csv file (.csv) :type file_extension: str :param disable_tqdm: if True disable the tqdm progress bar :return: