{ "cells": [ { "cell_type": "markdown", "id": "cc293965", "metadata": {}, "source": [ "# Minimal example : run ComSeg with a config dictionary" ] }, { "cell_type": "markdown", "id": "a68e804b", "metadata": {}, "source": [ "In this tutorial we present a simplify way to use ComSeg \n", "Download the test data for this tutorail at https://cloud.minesparis.psl.eu/index.php/s/HtYucchv9OGg6JN" ] }, { "cell_type": "code", "execution_count": 1, "id": "1907843e", "metadata": {}, "outputs": [], "source": [ "import sys" ] }, { "cell_type": "code", "execution_count": 2, "id": "66d19bec", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib\n", "import comseg\n", "import numpy as np\n", "import random\n", "import tifffile\n", "import importlib\n", "from comseg import dataset as ds\n", "from comseg import dictionary\n", "import scanpy\n", "%matplotlib inline\n", "import importlib\n", "from pathlib import Path" ] }, { "cell_type": "markdown", "id": "c2d11826", "metadata": {}, "source": [ "Parameters for ComSeg can be gathered in a single configuration dictionary. Below we give a minimal example of configuration dictionary ```config```. A comprehensive and documented version of this config dictionary is detailed at the end of this tutorial.\n", "\n", "Except for class instantiation, ComSeg functions accept a configuration dictionary as their sole argument. The values in ```config``` will override default values or any other provided arguments. \n" ] }, { "cell_type": "code", "execution_count": 11, "id": "7ebc6898", "metadata": {}, "outputs": [], "source": [ "#### HYPERPARAMETER ####\n", "MEAN_CELL_DIAMETER = 15 # in micrometer\n", "MAX_CELL_RADIUS = 50 # in micrometer\n", "#########################\n", "\n", "## here we present an extensive example of how to run ComSeg with \n", "\n", "\n", "path_dataset_folder = \"/home/tom/Bureau/test_set_tutorial_comseg/small_df\"\n", "path_to_mask_prior = \"/home/tom/Bureau/test_set_tutorial_comseg/mask\"\n", "\n", "\n", "config = {\n", " ### dataset initialisation\n", " \"dict_scale\" : {\"x\": 0.103, 'y': 0.103, \"z\": 0.3},\n", " \"mean_cell_diameter\" : MEAN_CELL_DIAMETER,\n", " \"gene_column\" : \"gene\",\n", " ### prior computation (if not already availble in external csv file)\n", " \"prior_name\":'in_nucleus',\n", " \"overwrite\":True,\n", " \"compute_centroid\":True,\n", " \"max_cell_radius\": MAX_CELL_RADIUS,\n", " ### final result \n", " \"alpha\" : 0.5,\n", " \"min_rna_per_cell\" : 5,\n", " \"allow_disconnected_polygon\":True,\n", " \"disable_tqdm\" : True,\n", " 'norm_vector': True,\n", " }" ] }, { "cell_type": "markdown", "id": "23f05930", "metadata": {}, "source": [ "### Run ComSeg with configuration dictionary." ] }, { "cell_type": "code", "execution_count": 13, "id": "57be879a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "config dict overwritting default parameters\n", "prior added to 07_CtrlNI_Pdgfra-Cy3_Serpine1-Cy5_004 and saved in csv file\n", "dict_centroid added for 07_CtrlNI_Pdgfra-Cy3_Serpine1-Cy5_004 \n", "prior added to 07_CtrlNI_Pdgfra-Cy3_Serpine1-Cy5_006 and saved in csv file\n", "dict_centroid added for 07_CtrlNI_Pdgfra-Cy3_Serpine1-Cy5_006 \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/tom/Bureau/phd/simulation/ComSeg_pkg/src/comseg/model.py:352: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`\n", " for node in self.community_anndata.obs[\"node_index\"][comm_index]:\n", "/home/tom/Bureau/phd/simulation/ComSeg_pkg/src/comseg/model.py:354: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`\n", " nn_expression_vector = nn_expression_vector / len(self.community_anndata.obs[\"node_index\"][comm_index])\n", "/home/tom/Bureau/phd/simulation/ComSeg_pkg/src/comseg/model.py:352: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`\n", " for node in self.community_anndata.obs[\"node_index\"][comm_index]:\n", "/home/tom/Bureau/phd/simulation/ComSeg_pkg/src/comseg/model.py:354: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`\n", " nn_expression_vector = nn_expression_vector / len(self.community_anndata.obs[\"node_index\"][comm_index])\n", "/home/tom/anaconda3/envs/ComSeg_env_py10/lib/python3.10/site-packages/anndata/_core/anndata.py:1818: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.\n", " utils.warn_names_duplicates(\"obs\")\n", "/home/tom/Bureau/phd/simulation/ComSeg_pkg/src/comseg/clustering.py:298: UserWarning: param_sctransform is none, expression vector are not normalized\n", " warnings.warn('param_sctransform is none, expression vector are not normalized')\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "number of cluster 9\n", "number of cluster after merging 9\n", "07_CtrlNI_Pdgfra-Cy3_Serpine1-Cy5_004\n", "07_CtrlNI_Pdgfra-Cy3_Serpine1-Cy5_006\n", "config dict overwritting the default parameters\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/tom/Bureau/phd/simulation/ComSeg_pkg/src/comseg/model.py:790: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`\n", " dict_polygon[anndata.obs[\"CellID\"][cell_index]] = alpha_shape\n", "/home/tom/Bureau/phd/simulation/ComSeg_pkg/src/comseg/model.py:790: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`\n", " dict_polygon[anndata.obs[\"CellID\"][cell_index]] = alpha_shape\n" ] } ], "source": [ "dataset = ds.ComSegDataset(\n", " path_dataset_folder=path_dataset_folder,\n", " dict_scale=config[\"dict_scale\"],\n", " mean_cell_diameter=config[\"mean_cell_diameter\"],\n", " gene_column=config[\"gene_column\"],\n", " path_to_mask_prior=path_to_mask_prior,\n", " prior_name=config[\"prior_name\"],\n", " disable_tqdm=config[\"disable_tqdm\"]\n", ")\n", "\n", "## if not already in the csv file \n", "dataset.add_prior_from_mask(config=config)\n", "dict_proba_edge = dataset.compute_edge_weight(config=config)\n", "\n", "Comsegdict = dictionary.ComSegDict(\n", " dataset=dataset,\n", " mean_cell_diameter=config[\"mean_cell_diameter\"],\n", " disable_tqdm=config[\"disable_tqdm\"]\n", "\n", ")\n", "\n", "Comsegdict.run_all(config=config)\n", "anndata_comseg, json_dict = Comsegdict.anndata_from_comseg_result(\n", "config=config\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "17c96fe0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "b42bafee", "metadata": {}, "source": [ "### Plot result" ] }, { "cell_type": "code", "execution_count": 14, "id": "77ae3753", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(
,\n", " )" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from comseg.utils import plot \n", "img_name = \"07_CtrlNI_Pdgfra-Cy3_Serpine1-Cy5_006\"\n", "G = Comsegdict[img_name].G\n", "nuclei = tifffile.imread(\n", " path_to_mask_prior + f'/{img_name}.tiff')\n", "\n", "plot.plot_result(G=G,\n", " nuclei = nuclei,\n", " key_node = 'cell_index_pred',\n", " title = None,\n", " dico_cell_color = None,\n", " figsize=(15, 15),\n", " spots_size = 10,\n", " plot_outlier = False)" ] }, { "cell_type": "markdown", "id": "3ac4433f", "metadata": {}, "source": [ "### Comprensive description of the configuration dictionnary" ] }, { "cell_type": "code", "execution_count": 10, "id": "fce9736c", "metadata": {}, "outputs": [], "source": [ "config = {\n", " ### dataset initialisation\n", " \"dict_scale\" : {\"x\": 0.103, 'y': 0.103, \"z\": 0.3},\n", " \"mean_cell_diameter\" : MEAN_CELL_DIAMETER,\n", " \"gene_column\" : \"gene\",\n", " ### prior computation \n", " \"prior_name\":'in_nucleus',\n", " \"overwrite\":True,\n", " \"compute_centroid\":True,\n", " ### CO-EXPRESSION COMPUTATION\n", " \"n_neighbors\" : 40,\n", " \"sampling\" : True,\n", " \"sampling_size\": 10000, \n", " ### KNN GRPAH \n", " 'k_nearest_neighbors': 10,\n", " 'prior_name' : 'in_nucleus',\n", " ### IN SITU CLUSTERING\n", " 'size_commu_min': 3,\n", " 'norm_vector': True,\n", " 'n_pcs': 4,\n", " 'clustering_method': 'leiden',\n", " 'n_neighbors': 20,\n", " 'resolution': 1,\n", " 'n_clusters_kmeans': 5,\n", " 'nb_min_cluster': 1,\n", " 'min_merge_correlation': 0.9,\n", " # RNA ASSIGMENT\n", " \"max_cell_radius\": MAX_CELL_RADIUS,\n", " ### final result \n", " \"return_polygon\":False,\n", " \"allow_disconnected_polygon\" : False,\n", " \"alpha\" : 0.5,\n", " \"min_rna_per_cell\" : 5,\n", " \"disable_tqdm\" : False\n", "\n", " }" ] }, { "cell_type": "markdown", "id": "ba850d2d", "metadata": {}, "source": [ "#### dataset initialisation\n", "- ```dict_scale``` : dictionary containing the pixel/voxel size of the images in µm, default is {\"x\": 0.103, 'y': 0.103, \"z\": 0.3}. Use to convert the detected spots coordinates in µm. \n", "- ```mean_cell_diameter```: the expected mean cell diameter in µm default is 15µm\n", "- ```gene_column``` : name of the column containing the gene name in the csv files\n", "#### computation of prior from segmentation\n", "- ```prior_name``` : name of the column to add in the csv files containing the prior label of each spot when computing prior from .tiff segmentation file\n", "- ```overwrite``` \n", "- ```compute_centroid```: if True, compute the centroid of each cell/nucleus in .tiff segmentation mask to use it for RNA-cell association\n", "\n", "#### Co-expression computation\n", "- ```n_neighbors``` : maximum number of neighbors default is 40\n", "- ```sampling``` : if True, sample the dataset to compute the co-expression weigth\n", "- ```sampling_size```:if sampling is True : number of proximity weighted expression vector to sample\n", "\n", "#### knn graph generation \n", "- ``k_nearest_neighbors``: number of nearest neighbors to consider for the KNN graph creation, reduce K to speed computation\n", " :type k_nearest_neighbors: int\n", "\n", "#### in-situ clustering \n", "- `size_commu_min`: This parameter is the minimum number of RNA in a community to be considered for the clustering (default is 3). It is of type `int`.\n", "- `norm_vector`: If set to True, the expression vector will be normalized using the scTRANSFORM normalization parameters. The normalization requires the following R packages: sctransform, feather, arrow. The normalization is important to do on a dataset with a high number of genes. It is of type `bool`.\n", "- `n_pcs`: This parameter is the number of principal components to compute for the clustering of the RNA communities expression vector. Set to 0 if no PCA is required. It is of type `int`.\n", "- `clustering_method`: This parameter is used to choose the clustering method. Options include \"leiden\", \"kmeans\", \"louvain\". It is of type `str`.\n", "- `n_neighbors`: This parameter is the number of neighbors similarity graph of the RNA communities expression vector clustering. It is of type `int`.\n", "- `resolution`: This parameter is the resolution parameter for the in-situ-clustering step if louvain or leiden are used. It is of type `float`.\n", "- `n_clusters_kmeans`: This parameter is the number of clusters for the kmeans clustering for `clustering_method` = \"kmeans\". It is of type `int`.\n", "- `nb_min_cluster`: This parameter is the minimum number of clusters to keep after the merge of the clusters. It is of type `int`.\n", "- `min_merge_correlation`: This parameter is the minimum correlation to merge clusters in the in situ clustering. It is of type `float`.\n", "\n", "#### rna association to centroid\n", "- `path_dataset_folder_centroid`: This parameter is the path to the folder containing the centroid in a csv or dictionary {cell : {z:,y:,x:}} for each image, use the same scale than then input csv. It is of type `str`.\n", "- `file_extension`: This parameter is the file extension of the centroid dictionary (.npy) or csv file (.csv). It is of type `str`. \n", "- `max_cell_radius`: This parameter is the maximum distance between a cell centroid and an RNA to be associated. It is of type `float`.\n", "\n", "#### final result \n", "- `min_rna_per_cell`: This parameter represents the minimum number of RNA to consider a cell. It is of type `int`.\n", "- `return_polygon`: If set to True, the function will return the polygon of the cells. The polygons are computed using the alphashape library. It is of type `bool`.\n", "- `alpha`: This parameter is used to compute the alphashape polygon. Alpha is between 0 and 1, where 1 corresponds to the convex hull of the cell. More details can be found at [alphashape](https://pypi.org/project/alphashape/). It is of type `float`.\n", "- `allow_disconnected_polygon`: If set to True, the function will allow disconnected polygons. It is of type `bool`.\n", "\n", "#### final result \n", "- `disable_tqdm` disable tqdm print " ] }, { "cell_type": "code", "execution_count": null, "id": "35b24ce8", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "ComSeg_env_py10", "language": "python", "name": "comseg_env_py10" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 5 }