{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cc293965",
   "metadata": {},
   "source": [
    "# Minimal example : run ComSeg with a config dictionary"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a68e804b",
   "metadata": {},
   "source": [
    "In this tutorial we present a simplify way to use ComSeg  \n",
    "Download the test data for this tutorail at https://cloud.minesparis.psl.eu/index.php/s/HtYucchv9OGg6JN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "1907843e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "66d19bec",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import matplotlib\n",
    "import comseg\n",
    "import numpy as np\n",
    "import random\n",
    "import tifffile\n",
    "import importlib\n",
    "from comseg import dataset as ds\n",
    "from comseg import dictionary\n",
    "import scanpy\n",
    "%matplotlib inline\n",
    "import importlib\n",
    "from pathlib import Path"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2d11826",
   "metadata": {},
   "source": [
    "Parameters for ComSeg can be gathered in a single configuration dictionary. Below we give a minimal example of configuration dictionary ```config```. A comprehensive and documented version of this config dictionary is detailed at the end of this tutorial.\n",
    "\n",
    "Except for class instantiation, ComSeg functions accept a configuration dictionary as their sole argument. The values in ```config``` will override default values or any other provided arguments.  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "7ebc6898",
   "metadata": {},
   "outputs": [],
   "source": [
    "#### HYPERPARAMETER ####\n",
    "MEAN_CELL_DIAMETER = 15  # in micrometer\n",
    "MAX_CELL_RADIUS = 50  # in micrometer\n",
    "#########################\n",
    "\n",
    "## here we present an extensive example of how to run ComSeg with \n",
    "\n",
    "\n",
    "path_dataset_folder = \"/home/tom/Bureau/test_set_tutorial_comseg/small_df\"\n",
    "path_to_mask_prior = \"/home/tom/Bureau/test_set_tutorial_comseg/mask\"\n",
    "\n",
    "\n",
    "config = {\n",
    "    ### dataset initialisation\n",
    "    \"dict_scale\" : {\"x\": 0.103, 'y': 0.103, \"z\": 0.3},\n",
    "    \"mean_cell_diameter\" : MEAN_CELL_DIAMETER,\n",
    "    \"gene_column\" : \"gene\",\n",
    "    ### prior computation (if not already availble in external csv file)\n",
    "    \"prior_name\":'in_nucleus',\n",
    "    \"overwrite\":True,\n",
    "    \"compute_centroid\":True,\n",
    "    \"max_cell_radius\": MAX_CELL_RADIUS,\n",
    "    ### final result \n",
    "    \"alpha\" :  0.5,\n",
    "    \"min_rna_per_cell\" : 5,\n",
    "    \"allow_disconnected_polygon\":True,\n",
    "    \"disable_tqdm\" : True,\n",
    "    'norm_vector': True,\n",
    "    }"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23f05930",
   "metadata": {},
   "source": [
    "### Run ComSeg with configuration dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "57be879a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "config dict overwritting default parameters\n",
      "prior added to 07_CtrlNI_Pdgfra-Cy3_Serpine1-Cy5_004 and saved in csv file\n",
      "dict_centroid added for 07_CtrlNI_Pdgfra-Cy3_Serpine1-Cy5_004 \n",
      "prior added to 07_CtrlNI_Pdgfra-Cy3_Serpine1-Cy5_006 and saved in csv file\n",
      "dict_centroid added for 07_CtrlNI_Pdgfra-Cy3_Serpine1-Cy5_006 \n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/tom/Bureau/phd/simulation/ComSeg_pkg/src/comseg/model.py:352: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`\n",
      "  for node in self.community_anndata.obs[\"node_index\"][comm_index]:\n",
      "/home/tom/Bureau/phd/simulation/ComSeg_pkg/src/comseg/model.py:354: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`\n",
      "  nn_expression_vector = nn_expression_vector / len(self.community_anndata.obs[\"node_index\"][comm_index])\n",
      "/home/tom/Bureau/phd/simulation/ComSeg_pkg/src/comseg/model.py:352: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`\n",
      "  for node in self.community_anndata.obs[\"node_index\"][comm_index]:\n",
      "/home/tom/Bureau/phd/simulation/ComSeg_pkg/src/comseg/model.py:354: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`\n",
      "  nn_expression_vector = nn_expression_vector / len(self.community_anndata.obs[\"node_index\"][comm_index])\n",
      "/home/tom/anaconda3/envs/ComSeg_env_py10/lib/python3.10/site-packages/anndata/_core/anndata.py:1818: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.\n",
      "  utils.warn_names_duplicates(\"obs\")\n",
      "/home/tom/Bureau/phd/simulation/ComSeg_pkg/src/comseg/clustering.py:298: UserWarning: param_sctransform is none, expression vector are not normalized\n",
      "  warnings.warn('param_sctransform is none, expression vector are not normalized')\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "number of cluster 9\n",
      "number of cluster after merging 9\n",
      "07_CtrlNI_Pdgfra-Cy3_Serpine1-Cy5_004\n",
      "07_CtrlNI_Pdgfra-Cy3_Serpine1-Cy5_006\n",
      "config dict overwritting the default parameters\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/tom/Bureau/phd/simulation/ComSeg_pkg/src/comseg/model.py:790: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`\n",
      "  dict_polygon[anndata.obs[\"CellID\"][cell_index]] = alpha_shape\n",
      "/home/tom/Bureau/phd/simulation/ComSeg_pkg/src/comseg/model.py:790: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`\n",
      "  dict_polygon[anndata.obs[\"CellID\"][cell_index]] = alpha_shape\n"
     ]
    }
   ],
   "source": [
    "dataset = ds.ComSegDataset(\n",
    "    path_dataset_folder=path_dataset_folder,\n",
    "    dict_scale=config[\"dict_scale\"],\n",
    "    mean_cell_diameter=config[\"mean_cell_diameter\"],\n",
    "    gene_column=config[\"gene_column\"],\n",
    "    path_to_mask_prior=path_to_mask_prior,\n",
    "    prior_name=config[\"prior_name\"],\n",
    "    disable_tqdm=config[\"disable_tqdm\"]\n",
    ")\n",
    "\n",
    "## if not already in the csv file \n",
    "dataset.add_prior_from_mask(config=config)\n",
    "dict_proba_edge = dataset.compute_edge_weight(config=config)\n",
    "\n",
    "Comsegdict = dictionary.ComSegDict(\n",
    "    dataset=dataset,\n",
    "    mean_cell_diameter=config[\"mean_cell_diameter\"],\n",
    "        disable_tqdm=config[\"disable_tqdm\"]\n",
    "\n",
    ")\n",
    "\n",
    "Comsegdict.run_all(config=config)\n",
    "anndata_comseg, json_dict = Comsegdict.anndata_from_comseg_result(\n",
    "config=config\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17c96fe0",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "b42bafee",
   "metadata": {},
   "source": [
    "### Plot result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "77ae3753",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(<Figure size 1500x1500 with 1 Axes>,\n",
       " <Axes: title={'center': 'cell_index_pred'}>)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from comseg.utils import plot \n",
    "img_name = \"07_CtrlNI_Pdgfra-Cy3_Serpine1-Cy5_006\"\n",
    "G = Comsegdict[img_name].G\n",
    "nuclei = tifffile.imread(\n",
    "    path_to_mask_prior + f'/{img_name}.tiff')\n",
    "\n",
    "plot.plot_result(G=G,\n",
    "            nuclei = nuclei,\n",
    "               key_node = 'cell_index_pred',\n",
    "                title = None,\n",
    "                dico_cell_color = None,\n",
    "                figsize=(15, 15),\n",
    "                spots_size = 10,\n",
    "                plot_outlier = False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ac4433f",
   "metadata": {},
   "source": [
    "### Comprensive description of the configuration dictionnary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "fce9736c",
   "metadata": {},
   "outputs": [],
   "source": [
    "config = {\n",
    "    ### dataset initialisation\n",
    "    \"dict_scale\" : {\"x\": 0.103, 'y': 0.103, \"z\": 0.3},\n",
    "    \"mean_cell_diameter\" : MEAN_CELL_DIAMETER,\n",
    "    \"gene_column\" : \"gene\",\n",
    "    ### prior computation \n",
    "    \"prior_name\":'in_nucleus',\n",
    "    \"overwrite\":True,\n",
    "    \"compute_centroid\":True,\n",
    "    ### CO-EXPRESSION COMPUTATION\n",
    "    \"n_neighbors\" : 40,\n",
    "    \"sampling\" : True,\n",
    "    \"sampling_size\": 10000,    \n",
    "    ### KNN GRPAH \n",
    "    'k_nearest_neighbors': 10,\n",
    "    'prior_name' : 'in_nucleus',\n",
    "    ### IN SITU CLUSTERING\n",
    "    'size_commu_min': 3,\n",
    "    'norm_vector': True,\n",
    "    'n_pcs': 4,\n",
    "    'clustering_method': 'leiden',\n",
    "    'n_neighbors': 20,\n",
    "    'resolution': 1,\n",
    "    'n_clusters_kmeans': 5,\n",
    "    'nb_min_cluster': 1,\n",
    "    'min_merge_correlation': 0.9,\n",
    "    # RNA ASSIGMENT\n",
    "    \"max_cell_radius\": MAX_CELL_RADIUS,\n",
    "    ### final result \n",
    "    \"return_polygon\":False,\n",
    "    \"allow_disconnected_polygon\" : False,\n",
    "    \"alpha\" :  0.5,\n",
    "    \"min_rna_per_cell\" : 5,\n",
    "    \"disable_tqdm\" : False\n",
    "\n",
    "    }"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba850d2d",
   "metadata": {},
   "source": [
    "#### dataset initialisation\n",
    "- ```dict_scale``` : dictionary containing the pixel/voxel size of the images in µm, default is {\"x\": 0.103, 'y': 0.103, \"z\": 0.3}. Use to convert the detected spots coordinates in µm. \n",
    "- ```mean_cell_diameter```: the expected mean cell diameter in µm default is 15µm\n",
    "- ```gene_column``` : name of the column containing the gene name in the csv files\n",
    "#### computation of prior from segmentation\n",
    "- ```prior_name``` : name of the column to add in the csv files containing the prior label of each spot when computing prior from .tiff segmentation file\n",
    "- ```overwrite``` \n",
    "- ```compute_centroid```: if True, compute the centroid of each cell/nucleus in .tiff segmentation mask to use it for RNA-cell association\n",
    "\n",
    "#### Co-expression computation\n",
    "- ```n_neighbors``` : maximum number of neighbors default is 40\n",
    "- ```sampling```  : if True, sample the dataset to compute the co-expression weigth\n",
    "- ```sampling_size```:if sampling is True : number of proximity weighted expression vector to sample\n",
    "\n",
    "#### knn graph generation \n",
    "- ``k_nearest_neighbors``: number of nearest neighbors to consider for the KNN graph creation, reduce K to speed computation\n",
    "        :type k_nearest_neighbors: int\n",
    "\n",
    "#### in-situ clustering \n",
    "- `size_commu_min`: This parameter is the minimum number of RNA in a community to be considered for the clustering (default is 3). It is of type `int`.\n",
    "- `norm_vector`: If set to True, the expression vector will be normalized using the scTRANSFORM normalization parameters. The normalization requires the following R packages: sctransform, feather, arrow. The normalization is important to do on a dataset with a high number of genes. It is of type `bool`.\n",
    "- `n_pcs`: This parameter is the number of principal components to compute for the clustering of the RNA communities expression vector. Set to 0 if no PCA is required. It is of type `int`.\n",
    "- `clustering_method`: This parameter is used to choose the clustering method. Options include \"leiden\", \"kmeans\", \"louvain\". It is of type `str`.\n",
    "- `n_neighbors`: This parameter is the number of neighbors similarity graph of the RNA communities expression vector clustering. It is of type `int`.\n",
    "- `resolution`: This parameter is the resolution parameter for the in-situ-clustering step if louvain or leiden are used. It is of type `float`.\n",
    "- `n_clusters_kmeans`: This parameter is the number of clusters for the kmeans clustering for `clustering_method` = \"kmeans\". It is of type `int`.\n",
    "- `nb_min_cluster`: This parameter is the minimum number of clusters to keep after the merge of the clusters. It is of type `int`.\n",
    "- `min_merge_correlation`: This parameter is the minimum correlation to merge clusters in the in situ clustering. It is of type `float`.\n",
    "\n",
    "#### rna association to centroid\n",
    "- `path_dataset_folder_centroid`: This parameter is the path to the folder containing the centroid in a csv or dictionary {cell : {z:,y:,x:}} for each image, use the same scale than then input csv. It is of type `str`.\n",
    "- `file_extension`: This parameter is the file extension of the centroid dictionary (.npy) or csv file (.csv). It is of type `str`.        \n",
    "- `max_cell_radius`: This parameter is the maximum distance between a cell centroid and an RNA to be associated. It is of type `float`.\n",
    "\n",
    "#### final result \n",
    "- `min_rna_per_cell`: This parameter represents the minimum number of RNA to consider a cell. It is of type `int`.\n",
    "- `return_polygon`: If set to True, the function will return the polygon of the cells. The polygons are computed using the alphashape library. It is of type `bool`.\n",
    "- `alpha`: This parameter is used to compute the alphashape polygon. Alpha is between 0 and 1, where 1 corresponds to the convex hull of the cell. More details can be found at [alphashape](https://pypi.org/project/alphashape/). It is of type `float`.\n",
    "- `allow_disconnected_polygon`: If set to True, the function will allow disconnected polygons. It is of type `bool`.\n",
    "\n",
    "#### final result \n",
    "- `disable_tqdm` disable tqdm print  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35b24ce8",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ComSeg_env_py10",
   "language": "python",
   "name": "comseg_env_py10"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}