Dataset

class comseg.dataset.ComSegDataset(path_dataset_folder, prior_name, path_to_mask_prior=None, mask_file_extension='.tiff', dict_scale={'x': 0.103, 'y': 0.103, 'z': 0.3}, mean_cell_diameter=15, gene_column='gene', image_csv_files: list = None, centroid_csv_files: list = None, path_cell_centroid=None, min_nb_rna_patch=None, disable_tqdm=False, config=None)

Bases: object

this class is in charge of :

  1. loading the CSV input

  2. computation of the co-expression matrix at the dataset scale

  3. add prior knowledge if available

The dataset class can be used like a dictionary of where the keys are the csv file names and the values are the csv

__init__(path_dataset_folder, prior_name, path_to_mask_prior=None, mask_file_extension='.tiff', dict_scale={'x': 0.103, 'y': 0.103, 'z': 0.3}, mean_cell_diameter=15, gene_column='gene', image_csv_files: list = None, centroid_csv_files: list = None, path_cell_centroid=None, min_nb_rna_patch=None, disable_tqdm=False, config=None)
Parameters:
  • path_dataset_folder (str) – path to the folder containing the csv files

  • path_to_mask_prior (str) – path to the folder containing the mask priors. They must have the same name as the corresponding csv files

  • mask_file_extension – file extension of the mask priors

  • dict_scale – dictionary containing the pixel/voxel size of the images in µm default is {“x”: 0.103, ‘y’: 0.103, “z”: 0.3}

Default mask_file_extension:

“.tiff”

it is then used to convert the coordinates of the spots in the csv files from pixels to µm. if your CSV files are already in µm set dict_scale to {“x”: 1, ‘y’: 1, “z”: 1} :type dict_scale: dict :param mean_cell_diameter: the expected mean cell diameter in µm default is 15µm :type mean_cell_diameter: float :param gene_column: name of the column containing the gene name in the csv files :type gene_column: str :param image_names_csv_file: list of image csv file name to consider in the dataset if None consider all the csv files in the folder :type image_names_csv_file: list :param centroid_name: list of the centroid csv file name , as to be in the same order as the image_names_csv_file :type centroid_name: list :min_nb_rna_patch: minimum number of rna in a patch to consider it in the dataset :disable_tqdm: if True disable the tqdm progress bar

convert_spots_coord_in_pixel(overwrite=True)

convert the coordinates of the spots in the csv files from µm to pixels using the dict_scale attribute of the dataset object

Parameters:

overwrite – if True, overwrite the csv files with the new coordinates in pixels else save the new csv files with the suffix “_pixel”

:type bool :return:

add_prior_from_mask(config=None, overwrite=False, compute_centroid=True, regex_df='*.csv')

This function add prior knowledge to the dataset. It adds a column in the csv files indicating prior label of each spot. It takes the positition of each spot and add the corresponding value of the segmentation mask prior (.tiff) at this position.

Parameters:

overwrite – if True, overwrite the prior_name column if it already exists

:type bool :param compute_centroid : if True, compute the centroid of each cell/nucleus in segmentation mask to use it for RNA-cell association :type bool :return: None

count_matrix_in_situ_from_knn(df_spots_label, n_neighbors=5, radius=None, remove_self_node=False, sampling=True, sampling_size=10000)

Compute the co-expression score matrix for the RNA spatial distribution

Parameters:
  • df_spots_label (pd.DataFrame) – dataframe with the columns x,y,z,gene. the coordinates are rescaled in µm by dict_scale attribute of the dataset object

  • n_neighbors (int) – maximum number of neighbors default is 40

  • radius – maximum radius of neighbors. It should be set proportionnaly to expected cell size, default is radius = mean_cell_diameter / 2

:param sampling : if True, sample the dataset to compute the correlation :type sampling: bool :param sampling_size: if sampling is True : number of proximity weighted expression vector to sample :type sampling_size: int :return: count_matrix of shape (N_rna, n_genes) where n_genes is the number of unique genes in df_spots_label each row is an ‘RNA expression vector’ summarizing local expression neighborhood of a molecule :rtype: np.array

compute_edge_weight(config=None, images_subset=None, n_neighbors=40, radius=None, distance='pearson', sampling=True, sampling_size=10000, remove_self_node=False)

compute the gene co-expression correlation at the dataset scale and save it in self.dict_co_expression

Parameters:
  • config (dict) – dictionary of parameters to overwrite the default parameters, default is None

  • images_subset (list) – default None, if not None, only compute the correlation on the images in the list

  • n_neighbors (int) – default 40 ,number of neighbors to consider in the knn graph

  • radius (float) – radius of the knn graph in micrometer default None, if None, radius = mean_cell_diameter/2

  • distance (str) – choose in [“pearson”, “spearman”] default is pearson

  • sampling (bool) – default False, if True, sample the dataset to compute the correlation

  • sampling_size – if sampling is True : number of proximity weighted expression vector to sample

Returns:

  • dico_proba_edge - a dictionary of dictionary correlation between genes. dict[gene_source][gene_target] = correlation

  • count_matrix - the count matrix used to compute the correlation

Return type:

dict, np.array