Minimal example : run ComSeg with a config dictionary
[1]:
import sys
[2]:
import pandas as pd
import matplotlib
import comseg
import numpy as np
import random
import tifffile
import importlib
from comseg import dataset as ds
from comseg import dictionary
import scanpy
%matplotlib inline
import importlib
from pathlib import Path
Parameters for ComSeg can be gathered in a single configuration dictionary. Below we give a minimal example of configuration dictionary config. A comprehensive and documented version of this config dictionary is detailed at the end of this tutorial.
Except for class instantiation, ComSeg functions accept a configuration dictionary as their sole argument. The values in config will override default values or any other provided arguments.
[11]:
#### HYPERPARAMETER ####
MEAN_CELL_DIAMETER = 15 # in micrometer
MAX_CELL_RADIUS = 50 # in micrometer
#########################
## here we present an extensive example of how to run ComSeg with
path_dataset_folder = "/home/tom/Bureau/test_set_tutorial_comseg/small_df"
path_to_mask_prior = "/home/tom/Bureau/test_set_tutorial_comseg/mask"
config = {
### dataset initialisation
"dict_scale" : {"x": 0.103, 'y': 0.103, "z": 0.3},
"mean_cell_diameter" : MEAN_CELL_DIAMETER,
"gene_column" : "gene",
### prior computation (if not already availble in external csv file)
"prior_name":'in_nucleus',
"overwrite":True,
"compute_centroid":True,
"max_cell_radius": MAX_CELL_RADIUS,
### final result
"alpha" : 0.5,
"min_rna_per_cell" : 5,
"allow_disconnected_polygon":True,
"disable_tqdm" : True,
'norm_vector': True,
}
Run ComSeg with configuration dictionary.
[13]:
dataset = ds.ComSegDataset(
path_dataset_folder=path_dataset_folder,
dict_scale=config["dict_scale"],
mean_cell_diameter=config["mean_cell_diameter"],
gene_column=config["gene_column"],
path_to_mask_prior=path_to_mask_prior,
prior_name=config["prior_name"],
disable_tqdm=config["disable_tqdm"]
)
## if not already in the csv file
dataset.add_prior_from_mask(config=config)
dict_proba_edge = dataset.compute_edge_weight(config=config)
Comsegdict = dictionary.ComSegDict(
dataset=dataset,
mean_cell_diameter=config["mean_cell_diameter"],
disable_tqdm=config["disable_tqdm"]
)
Comsegdict.run_all(config=config)
anndata_comseg, json_dict = Comsegdict.anndata_from_comseg_result(
config=config
)
config dict overwritting default parameters
prior added to 07_CtrlNI_Pdgfra-Cy3_Serpine1-Cy5_004 and saved in csv file
dict_centroid added for 07_CtrlNI_Pdgfra-Cy3_Serpine1-Cy5_004
prior added to 07_CtrlNI_Pdgfra-Cy3_Serpine1-Cy5_006 and saved in csv file
dict_centroid added for 07_CtrlNI_Pdgfra-Cy3_Serpine1-Cy5_006
/home/tom/Bureau/phd/simulation/ComSeg_pkg/src/comseg/model.py:352: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
for node in self.community_anndata.obs["node_index"][comm_index]:
/home/tom/Bureau/phd/simulation/ComSeg_pkg/src/comseg/model.py:354: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
nn_expression_vector = nn_expression_vector / len(self.community_anndata.obs["node_index"][comm_index])
/home/tom/Bureau/phd/simulation/ComSeg_pkg/src/comseg/model.py:352: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
for node in self.community_anndata.obs["node_index"][comm_index]:
/home/tom/Bureau/phd/simulation/ComSeg_pkg/src/comseg/model.py:354: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
nn_expression_vector = nn_expression_vector / len(self.community_anndata.obs["node_index"][comm_index])
/home/tom/anaconda3/envs/ComSeg_env_py10/lib/python3.10/site-packages/anndata/_core/anndata.py:1818: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
utils.warn_names_duplicates("obs")
/home/tom/Bureau/phd/simulation/ComSeg_pkg/src/comseg/clustering.py:298: UserWarning: param_sctransform is none, expression vector are not normalized
warnings.warn('param_sctransform is none, expression vector are not normalized')
number of cluster 9
number of cluster after merging 9
07_CtrlNI_Pdgfra-Cy3_Serpine1-Cy5_004
07_CtrlNI_Pdgfra-Cy3_Serpine1-Cy5_006
config dict overwritting the default parameters
/home/tom/Bureau/phd/simulation/ComSeg_pkg/src/comseg/model.py:790: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
dict_polygon[anndata.obs["CellID"][cell_index]] = alpha_shape
/home/tom/Bureau/phd/simulation/ComSeg_pkg/src/comseg/model.py:790: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
dict_polygon[anndata.obs["CellID"][cell_index]] = alpha_shape
[ ]:
Plot result
[14]:
from comseg.utils import plot
img_name = "07_CtrlNI_Pdgfra-Cy3_Serpine1-Cy5_006"
G = Comsegdict[img_name].G
nuclei = tifffile.imread(
path_to_mask_prior + f'/{img_name}.tiff')
plot.plot_result(G=G,
nuclei = nuclei,
key_node = 'cell_index_pred',
title = None,
dico_cell_color = None,
figsize=(15, 15),
spots_size = 10,
plot_outlier = False)
[14]:
(<Figure size 1500x1500 with 1 Axes>,
<Axes: title={'center': 'cell_index_pred'}>)
Comprensive description of the configuration dictionnary
[10]:
config = {
### dataset initialisation
"dict_scale" : {"x": 0.103, 'y': 0.103, "z": 0.3},
"mean_cell_diameter" : MEAN_CELL_DIAMETER,
"gene_column" : "gene",
### prior computation
"prior_name":'in_nucleus',
"overwrite":True,
"compute_centroid":True,
### CO-EXPRESSION COMPUTATION
"n_neighbors" : 40,
"sampling" : True,
"sampling_size": 10000,
### KNN GRPAH
'k_nearest_neighbors': 10,
'prior_name' : 'in_nucleus',
### IN SITU CLUSTERING
'size_commu_min': 3,
'norm_vector': True,
'n_pcs': 4,
'clustering_method': 'leiden',
'n_neighbors': 20,
'resolution': 1,
'n_clusters_kmeans': 5,
'nb_min_cluster': 1,
'min_merge_correlation': 0.9,
# RNA ASSIGMENT
"max_cell_radius": MAX_CELL_RADIUS,
### final result
"return_polygon":False,
"allow_disconnected_polygon" : False,
"alpha" : 0.5,
"min_rna_per_cell" : 5,
"disable_tqdm" : False
}
dataset initialisation
dict_scale: dictionary containing the pixel/voxel size of the images in µm, default is {“x”: 0.103, ‘y’: 0.103, “z”: 0.3}. Use to convert the detected spots coordinates in µm.mean_cell_diameter: the expected mean cell diameter in µm default is 15µmgene_column: name of the column containing the gene name in the csv files #### computation of prior from segmentationprior_name: name of the column to add in the csv files containing the prior label of each spot when computing prior from .tiff segmentation fileoverwritecompute_centroid: if True, compute the centroid of each cell/nucleus in .tiff segmentation mask to use it for RNA-cell association
Co-expression computation
n_neighbors: maximum number of neighbors default is 40sampling: if True, sample the dataset to compute the co-expression weigthsampling_size:if sampling is True : number of proximity weighted expression vector to sample
knn graph generation
k_nearest_neighbors: number of nearest neighbors to consider for the KNN graph creation, reduce K to speed computation :type k_nearest_neighbors: int
in-situ clustering
size_commu_min: This parameter is the minimum number of RNA in a community to be considered for the clustering (default is 3). It is of typeint.norm_vector: If set to True, the expression vector will be normalized using the scTRANSFORM normalization parameters. The normalization requires the following R packages: sctransform, feather, arrow. The normalization is important to do on a dataset with a high number of genes. It is of typebool.n_pcs: This parameter is the number of principal components to compute for the clustering of the RNA communities expression vector. Set to 0 if no PCA is required. It is of typeint.clustering_method: This parameter is used to choose the clustering method. Options include “leiden”, “kmeans”, “louvain”. It is of typestr.n_neighbors: This parameter is the number of neighbors similarity graph of the RNA communities expression vector clustering. It is of typeint.resolution: This parameter is the resolution parameter for the in-situ-clustering step if louvain or leiden are used. It is of typefloat.n_clusters_kmeans: This parameter is the number of clusters for the kmeans clustering forclustering_method= “kmeans”. It is of typeint.nb_min_cluster: This parameter is the minimum number of clusters to keep after the merge of the clusters. It is of typeint.min_merge_correlation: This parameter is the minimum correlation to merge clusters in the in situ clustering. It is of typefloat.
rna association to centroid
path_dataset_folder_centroid: This parameter is the path to the folder containing the centroid in a csv or dictionary {cell : {z:,y:,x:}} for each image, use the same scale than then input csv. It is of typestr.file_extension: This parameter is the file extension of the centroid dictionary (.npy) or csv file (.csv). It is of typestr.max_cell_radius: This parameter is the maximum distance between a cell centroid and an RNA to be associated. It is of typefloat.
final result
min_rna_per_cell: This parameter represents the minimum number of RNA to consider a cell. It is of typeint.return_polygon: If set to True, the function will return the polygon of the cells. The polygons are computed using the alphashape library. It is of typebool.alpha: This parameter is used to compute the alphashape polygon. Alpha is between 0 and 1, where 1 corresponds to the convex hull of the cell. More details can be found at alphashape. It is of typefloat.allow_disconnected_polygon: If set to True, the function will allow disconnected polygons. It is of typebool.
final result
disable_tqdmdisable tqdm print
[ ]: