palmettobug.Analysis_functions.SpatialAnalysis

The SpatialAnalysis class serves a unifying class FOR THE nonGUI API coordinating all the spatial functions/methods in PalmettoBUG IT IS NOT USED IN THE GUI AT PRESENT! Instead, the individual spatial classes are called by the GUI. These were made firstm before the coordinating class, which is why there is a disconnect between the implementation in the GUI and outside it.

Classes

SpatialAnalysis

This class serves as a coordinating class for the three spatial analysis sub-classes. In the GUI, these subclasses are currently called directly (for historical reasons

Module Contents

class palmettobug.Analysis_functions.SpatialAnalysis.SpatialAnalysis

This class serves as a coordinating class for the three spatial analysis sub-classes. In the GUI, these subclasses are currently called directly (for historical reasons and because there is no real reason to update that). However, for use of PalmettoBUG in scripting outside the GUI, it is convenient to have a unified class where all the spatial methods can be accessed.

The methods of this class are wrappers on methods of the 3 subclasses. Because of this, it can be divided into a number of groupings:

>>> add data, cell maps, neighbors, neighborhoods, spaceANOVA, and edt    (in order)

Args:

(None) – the key set-up steps in this class are called in the methods of this class, not when it is initialized

Key Attributes:

exp:: This is the connected Analysis object, containing the anndata (exp.data) which holds most of the data used for calculations
SpaceANOVA, neighbors, edt:: These are subclasses that contain the actual methods used by this higher-level class. The methods of this class are all wrappers on methods contained in one of these sub-classes.

edt

SpaceANOVA

neighbors

add_Analysis(Analysis) → None: Connects the Spatial methods here to a palmettobug.Anlysis object. Edits to the Analysis object will affect the Spatial methods here

plot_cell_maps(plot_type: str, id: str | int | None = None, clustering: str = 'merging') → matplotlib.pyplot.figure

This plot cell maps either as “points” or as “masks”.

Args:

plot_type (string):: either “points” or “masks”. If “points” then the cells are represented at dots or various sizes on a white background. If “masks” then the cells are represented as their mask shapes on a black background (“masks” uses a squidpy plotting function).
id (string, integer, or None):: If None, will make cell maps for every sample_id / image in the dataset. If not None, will only make a plot for the specified image. id’s should either be the sample_id (for which a string or integer form will work) or the file_name of the desired image.
clustering (string):: the name of the column in self.exp.data.obs to use for coloring the cells. Usually one of the standard clustering column names (“merging”, “metaclustering”, etc.)

Returns:

matplotlib.figure or None

do_neighbors(radius_or_neighbors: str, number: int) → None

Creates the neighbor-graph between cells in the dataset (using their centroids). This step is necessary before performing any of the other neighbor-based methods. It uses the squidpy.gr.spatial_neighbors function to generate the neighborhood graph.

Args:

radius_or_neighbors (string):: whether to create the neighbor graph using a fixed radius (“Radius”) or to create the neighbor graph using a fixed number of neartest neighbors (“Neighbor”)
number (integer):: either the length of the search radius in pixels, or the number of nearest neighbors per cell (depending on radius_or_neighbors parameter)

Returns:

(None)

plot_neighbor_interactions(clustering: str = 'merging', facet_by: str = 'None', col_num: int = 1, filename: str | None = None) → matplotlib.pyplot.figure

This method wraps squidpy’s gr/pl.interaction_matrix functions. It plots a heatmap representing the number of interactions between cell types in the dataset. Note that this is an absolute number, and is effected by the abundance of celltypes (more abundant celltypes will have more interactions).

Args:

clustering (string):: the name of the column in self.exp.data.obs to use for grouping cells into cell types. Usually one of the standard clustering column names (“merging”, “metaclustering”, etc.). Their should be >1 unique cluster, the heatmap’s dimensions with only one cluster would be 1x1
facet_by (string):: a name of a column in self.exp.data.obs or “None”. If not “None”, used to interaction matrices for subsets of the data, for example to compare interaction matrices between conditions. The first panel is always the interaction matrix for the entire dataset (not subsetted).
col_num (integer):: If facetting, how many columns to have in the figure.
filename (string or None):: if not None, will write the plot as /{filename}.png to the spatial save folder of the palmettobug analysis directory.

Returns:

matplotlib.figure

plot_neighbor_enrichment(clustering: str = 'merging', facet_by: str = 'None', col_num: int = 1, seed: int = 42, n_perms: int = 1000, filename: None | str = None) → matplotlib.pyplot.figure

This method wraps squidpy’s gr/pl.neighborhood_enrichment functions. It plots a heatmap representing the enrichment of interactions over the random expectation between cell types in the dataset. This is calculated by permutation test, and the values of the heatmap are z-scores between the interactions found in the permutation test and the empirical number of interactions.

Args:

clustering (string):: the name of the column in self.exp.data.obs to use for grouping cells into cell types. Usually one of the standard clustering column names (“merging”, “metaclustering”, etc.). Their should be >1 unique cluster, the heatmap’s dimensions with only one cluster would be 1x1
facet_by (string):: a name of a column in self.exp.data.obs or “None”. If not “None”, used to interaction matrices for subsets of the data, for example to compare interaction matrices between conditions. The first panel is always the neighborhood enrichment for the entire dataset (not subsetted).
col_num (integer):: If facetting, how many columns to have in the figure.
seed (integer):: random seed for the permutation test
n_perms (integer):: how many permutations to perform in the permutation test
filename (string or None):: if not None, will write the plot as /{filename}.png to the spatial save folder of the palmettobug analysis directory.

Returns:

matplotlib.figure

plot_neighbor_centrality(clustering: str = 'merging', score: str = 'closeness_centrality', filename: str | None = None) → matplotlib.pyplot.figure

Wraps squidpy’s gr/pl.centrality_scores functions. clustering corresponds to “cluster_key” in squidpy’s API, and score corresponds to “score”.

Args:

clustering (string):

The cell type grouping (‘merging’, ‘ metaclustering’, etc.) to plot centrality scores for

score (string):: The type of centrality score to plot: [‘degree_centrality’,’closeness_centrality’,’average_clustering’]
filename (string or None):: If not None, specifies the filename to save the plot under (as a PNG) in the standard /Spatial_plots folder of the analysis folder.

do_neighborhood_CNs(clustering: str = 'merging', leiden_or_flowsom: str = 'FlowSOM', seed: int = 42, resolution: float = 1.0, min_dist: float = 0.1, n_neighbors: int = 15, **kwargs) → matplotlib.pyplot.figure

This method uses a previously constructed neighbor graph and a cell clustering to identify the proportions of each cell type among the neighbors of every cell, then runs an unsupervised clustering algorithm (FlowSOM or Leiden) to group the cells in “cellular neighborhoods” (CNs). This neighborhood grouping is appended to self.exp.data.obs as a “CN” column, and can be used in all the same ways as any other annotation / clustering to generate plots, etc. Additionally, a figure is returned that is unique to the type of clustering performed – if FlowSOM, a minimum spanning tree is returned while if Leiden then a UMAP is returned.

Args:

clustering (string):: the name of the column in self.exp.data.obs to use for grouping cells into cell types. Usually one of the standard clustering column names (“merging”, “metaclustering”, etc.).
leiden_or_flowsom (string):: “Leiden” or “FlowSOM” – determines which of the unsupervised clsutering algorithms will be used to group the cells.
seed (integer):: The random seed for the clustering algorithm
resolution (float):: ONLY for Leiden clustering – corresponds to the same parameter (resolution) in scanpy’s tl.leiden function
min_dist (float):: ONLY for Leiden clustering – corresponds to the same parameter (min_dist) in scanpy’s tl.umap function, which is necessary for leiden clustering
n_neighbors (integer):: ONLY for Ledien clustering – corresponds to the same parameter in scanpy’s pp.neighbors function, which is necessary for leiden clustering
**kwargs:: ONLY for FlowSOM – these passed into the FlowSOM class (copied from saesys lab FlowSOM_Python repository). This allows key parameters like the number of trainig cycles (rlen), number of output clusters (n_clusters), and x/y dimensions (xdim and ydim) to be passed to the FlowSOM instance.

Returns:

matplotlib.figure

plot_CN_graph(filename: str | None = None) → matplotlib.pyplot.figure: UMAP or star-plot – note that this figure is already returned by the method above

plot_CN_heatmap(clustering: str = 'merging', **kwargs) → matplotlib.pyplot.figure: Plots a heatmap of the proportions of the cell types in each of the CN clusters

plot_CN_abundance(clustering: str, cols: int = 3) → matplotlib.pyplot.figure: Plots a facetted barplot of the proportion of each cell type in each of the CN clusters

estimate_SpaceANOVA_min_radii(with_empty_space: bool = True) → int

This uses information about the cell masks & images (such as perimeter, area, cell occupied bounding-box areas, etc.).

If with_empty_space is True, will further adjust up the estimating minimum radii using the proportion of empty space in the cell-occupied regions of the images

do_SpaceANOVA_ripleys_stats(clustering: str, max: int = 100, min: int = 10, step: int = 1, condition1: str | None = None, condition2: str | None = None, threshold: int = 10, permutations: int = 0, seed: int = 42, center_on_zero: bool = False, silence_zero_warnings: bool = True, suppress_threshold_warnings: bool = False) → None

Calculates Ripley’s spatial statistics for every celltype-celltype pair in clustering in every image. The necessary first step in the SpaceANOVA analysis pipeline.

Args:

clustering (string):

the name of the column in self.exp.data.obs to use for grouping cells into cell types. Usually one of the standard clustering column names (“merging”, “metaclustering”, etc.).

max / min / step (integers):

these are the integers that determine at which radii statistics will be calculated. An easy to see what those radii will be is

>>> list(range(min, max, step))

The default looks at the range 10-100 with a step size of 1. The min is set to be > 0 because the first few radii should essentiall always have zero cell interactions in them. This is because we calculate using cell centroids, so even if two cells directly touch, the first few radii will still not have an interaction. For example, if a perfectly circular cell has a diameter of 10, then only radii > 5 will even have a change of encountered another cell since radii < 5 will only search space INSIDE the cell. And even once the earch radius is outside the cell, it does not count as an interaction when it touches another cell – it only is counted when it touches the centroid of the other cell. Because of this, usually radii < 10 or so have almost no interaction and therefore very unusual behaviour, and might be best dropped from the calculations.

condition1 / condition2 (string or None):

If both are None (default), then every condition in the dataset is used (and the fANOVAs are multi-comparison). Else if both condition1/2 are specificied, they should be unique values in the ‘condition’ columns of self.exp.data.obs, and the SpaceANOVA analysis will only look at those conditions, using pairwise comparisons / ANOVAs.

threshold (integer):

default = 10. If at least of the celltypes in a given celltype-celltype pair has fewer cells than this threshold in a given image, that image will be skipped & no Ripley statistics will be calculated from that image for that celltype-cetype pair. Note that is a given celltype-celltype pair never passes this threshold for any of the images for a given condition, then it will be ignored for that condition. Further, if only one condition (or no conditions) has images that pass the threshold, then an ANOVA for that celltype-celltype pair is impossible and will be skipped. However, even then, the Ripley’s statistics that were successfully calculated for that single condition can still be plotted.

permutations (integer):

If greater than zero, than a permutation correction will be applied to the data. This is done by randomizing the celltype labels in an image and calculating the average Ripley’s K for those randomizations. The average random K for the celltype is then substracted from the calculated Ripley’s K for that celltype in that image. This corrected K is than used to calculate Ripley’s L / g as normal. Permutation correction can slow the calculation substantially, but is almost always recommended as it uses the actual strucutre of the cells in the images to correct the values of the Ripley’s statistics. This is a powerful and simple way to correct for holes / inhomogeneities in the tissue.

seed (integer):

This is the random seed used for the SpaceANOVA methods. This includes the random permutations for the permutation correction, but also the seeds used plotting error regions and fANOVA. The seeds for plotting & fANOVA can be set separately when calling those functions but by default whatever you use here for the seed will be used for those steps as well.

center_on_zero (boolean):

ONLY with permutation correction (permutations > 0). This determines whether to ‘center’ Ripley’s g on 0 or on 1. Ripley’s g is unique among the Ripley’s statistics in that it is particularly easy to interpret, as its theoretical value in a random point pattern is equal to a straight line at 1, with values above 1 indicating more association between points that expected, and value less than 1 indicating less interactions that expected. However, the permutation correction shifts this centerpoint to 1 when the permutation is substracted from the calculated K. Additionally, when this substract is done the shape of K / L will deviate strongly from the theoretical shape of those statistic’s curves. So: If this parameter is True, then this shift is allowed to occur, and g will need to be interpreted as centered on 0.

>>> Permutation correction is: K = K_data - K_permutation

If this paramter is False, then after substracting the permutaiton K, the theoretical K { pi*(r^2) } is added back, which shifts the center of g to 1 without changing its shape. This change also restores the shape of the K / L statistics to better match their more usual, monotonically increasing shape

>>> Permutation correction is: K = K_data + (K_theoretical - K_permutation)

silence_zero_warnings (boolean):

this method generates a large number of zero division errors, even in a normal run. By default these are silenced.

suppress_threshold_warnings (boolean):

If True, will not print warnings about images failing to meet cell number thresholds

plot_spaceANOVA_function(stat: str, comparison: str = None, seed: int | None = None, f_stat: str | None = None, hline: int | None = None, output_directory: str | None = None)

This function plots a selected Ripley’s statistic for a celltype-celltype pair, and optionally also the signle-radii f-values from ANOVA tests conducted at each point along the Ripley stat graph.

Args:

stat (string):: either “K”, “L”, or “g”. Determines which Ripley’s statistic to plot
comparison (string or None):: a string with the form {celltype1}___{celltype2}. The triple underscore in the middle is how this string is split into the two cell types of interest (don’t have a triple underscore inside your cell type labels!). A full list of the available comparisons of this form can also be easily accessed with the self.SpaceANOVA._all_comparison_list attribute
seed (integer or None):: the random seed for the fANOVA function. If None, then the seed previously selected in self.do_SpaceANOVA_ripleys_stats will be used.
f_stat (string or None):: if not None, should be “f” (typical), “padj”, or “p”. If not None, adds a panel to the final plot showing the results of (standard, not functional) ANOVA tests comparing conditions at every individual radii. This is useful for visualizing at what distance the difference between conditions is most significant.
hline (int or None):: if not None, draws a horizontal line on the Ripley’s statistics plot at the value (usually ONLY when plotting the ‘g’ statistic, and set to 0 or 1, depending on where the graph is centered)
output_directory (string or None):: If None, the plots are exported to the automatic / standard directory in the PalmettoBUG project. if not None, should the path to a folder where the plots can be exported. (ONLY used if comparison is None)

run_SpaceANOVA_statistics(stat: str = 'g', seed: int | None = None): This runs the functional ANOVA on the available Ripley’s statistics, returning 3 datatables for the (adjusted) p-value, and fANOVA stat

plot_spaceANOVA_heatmap(stat: str, filename: None | str = None) → matplotlib.pyplot.figure

Plots a heatmap from one of the dataframes returned / created by self.run_SpaceANOVA_statistics. If plotting a (adjusted) p-value, as is typical, the statistic is transformed by the negative log first so that high number indicate higher significance.

stat = ‘p’, ‘padj’, or ‘f’

do_edt(pixel_classifier_folder: str, masks_folder: str, maps: str = '/classification_maps', smoothing: int = 10, stat: str = 'mean', normalized: bool = True, background: bool = False, marker_class: str | None = 'spatial_edt', auto_panel: bool = True, output_edt_folder: None | str | pathlib.Path = None, save_path: None | str | pathlib.Path = None)

Calculates the euclidean distance between cell masks (provided in masks_folder) and matching pixel classifications (pixel_classifier_folder). This appends the calculated edt for each cell to self.exp.data, as a new ‘channel’ / ‘antigen’, where it can then be used for plotting & calculations.

Args:

pixel_classifier_folder (str):

the path to a PalmettoBUG-generated pixel classifier folder. This folder needs to contain a subfolder of .tiffs containing the pixel class predictions (see maps argument), and contain a biological_labels.csv indicating what the biological names of each class in the classifier are. When a pixel classifier with > 1 predicted pixel class is used, an edt statistic will be calculated separately for each and added to self.exp.data

masks_folder (str):

the path to a folder containing a set of .tiff files of cell segmentation masks. These .tiffs should match those in f’{pixel_classifier_folder}/{maps}’.

maps (str):

should be either “/classification_maps” or “/merged_classification_maps”. This determines which subfolder from pixel_classifier_folder that contains the .tiff files of the pixel classification maps to usea for calculating the distance from. The filenames of these .tiffs should match those in masks_folder

smoothing (int):

If == 0, no smoothing is performed. Otherwise this indicates the size of isolated pixel class regions to smooth out before calculating distances. As in, if smoothing == 10 (default) regions of a pixel class smaller than 10 pixels will be dropped and “smoothed” into the surrounding pixel classes using mode-based fill-in (the mode of the remaining, closest neighbor pixels will be used to assign the replacement value for dropped pixels). Why smooth? When calculating the distance form a pixel class using the Euclidean Distance Transform (EDT), very small pixel regions can have an outsized impact on the final EDT map, so removing spurious / small regions can help clean up the final calculation.

stat (str):

One of “mean”, “median”, or “minimum”. This determines what statistic is read off of each segmentation region when calculating the edt value for each cell. The default, “mean”, is the most common use, and represents the average distance from the pixel class across the whole cell’s spatial footprint. “min” means that the calculated value for each cell with just be its minimum distance to the class of interest (its closest point). Of note, when “min” is the selected statistic, normalization cannot be performed (see f0ollowing argument).

normalized (bool):

Whether to normalize (True) or not (False). In this case, normalization means dividing each cell’s determined edt value by the average of the image that cell is in. As in, if “median” statistic is selected with normalized = True, each cell’s edt value will be:

cell_stat = median(cell_edt) / median(image_edt)

instead of the non-normalized value:

cell_stat = median(cell_edt)

When “mean” is used as a statistic, then the normalization factor is the mean of the image’s edt values. Normalization cannot be calculated this way with a statistic is “min”, and so is ignored (there is no normalization for “min”).

Normalization is useful as it is a way to help correct for the abundance of the pixel class. As in, if one image is 70% within a pixel class (lets say the pixel class is for fibrotic regions) its edt distances will be much lower across the image than an image where only 30% of the image is the fibrotic class. This would make all cells - regardless of cell type - in image 1 have much lower edt values than the cells in image 2, which would not be an inaccurate conclusion (the cells in image 1 are genuinely closer to fibrotic regions), but introduces a confounding factor: have the cells moved towards the fibrosis, or has the fibrotic regions expanded? Normalization helps address this, in part, as it takes into account the total quantity & positioning of the pixel class in each image when calculating the edt value for each cell.

background (bool):

whether to include the ‘background’ class in the edt calculations (True) or not (False, default). Usually, the background class is ignored, because it is not biologically relevant, but in some situations there is effectively no background class / even the background is biologically meaningful. For example, if a classifier is trained to identify broad tissue regions in a sample (such as intestinal crypt lumen / epithelia / lamina propia in a colon section), it might be the case that every part of the image falls into one of the pixel classes and there is no ‘background’. Because of how supervised classifier’s are trained in PalmettoBUG, a background class is always created so this option is useful if that ‘background’ is actually a relevant grouping.

marker_class (str or None):

what marker_class (self.exp.data.var[‘antigen’]) to assign the edt columns when they are added to self.exp.data.X. By default, this is “spatial_edt”, as it helps the subsequent plottting functions easily find the edt channels while ignoring the other marker_classes. However, setting this to “type” / “state” / “none” is also allowed, if you want to perform plots / calculations that combine both the spatial edt data and other channels.

auto_panel (bool):

Whether to automatically add the channels to self.exp.data (True, default), or only to return a panel dataframe (for manually editing marker_class, say if you want to assign different edt statistics from a single classifier to different marker_class-es). If False, you will need to manually add the edt data to self.exp.data with self.edt.append_distance_transform(distances_panel = {your edited marker_class panel}).

output_edt_folder (None, string, or Path):

Default is None (no edt map export). If not None, then should be the path where folders of .tiff files can be exported. Specifically, the folders used will be the f’{output_edt_folder}_{class_biological_label}’ for each class in the classifier. The saved .tiff files will be the intermediate Euclidean distances transforms for that class, from which the stats for each cell mask were calculated.

save_path (None, string, or Path):

Default is None (edt valuesa re not saved). If not None, then should be a file path where a csv file can be written. This csv will contain the information for all the edt’s calculated for the provided pixel classifier / masks pairing. Note that marker_class information will not be saved (that will need to be set again on re-load of the saved edt values)

Returns:

a pandas dataframe (panel) which can be used to see or set the marker_class information

a pandas dataframe (self.edt.results) which contains the edt calculations for every cell and pixel class

do_reload_edt(dataframe: pandas.DataFrame, marker_class: str) → None

Loads a column of data into the anndata of the experiment (meant for saved edt information, but could be used for any type of channel)

Args:

dataframe (pandas DataFrame, or Path/string):: If not a pandas DataFrame, will attempt to pandas.read_csv(dataframe) first. This dataframe should be as long as the number of cells in the data and have columns representing spatial_edt data for each cell. Its format should match the format of table exported by do_edt

plot_edt_heatmap(groupby_col: str, marker_class: str = 'spatial_edt', filename: None | str = None) → matplotlib.pyplot.figure

Plots a heatmap for spatial edt – default marker_class is spatial_edt, and export folder (if filename is provided) is in /Spatial_plots

groupby_col specifies a clustering (such as ‘merging’, ‘metaclustering’, etc.) to group the heatmap by

plot_edt_boxplot(var_column: str, groupby_col: str = 'merging', facet_col: str = 'condition', col_num: int = 3, filename: str = '') → matplotlib.pyplot.figure

Plots a channel on a horizontal boxplot. Could be used for non-spatial_edt data, but export folder (if a filename is provided) is in /Spatial_plots.

Args:

var_column (str):: the channel to use for the plots. Usually a spatial edt channel (like ‘distance to Vimentin’), but could be any of the channels in self.exp.data.var[‘antigen’]
groupby_col (str):: the column in self.exp.data.obs that will be used to gruop the box plot (one box per unique value in this column, per facet)
facet_col (str):: the column in self.exp.data.obs that will be used to split the data into two boxplots. Usually (and by default) facetted on the condition column, for comparison of treatment vs. control
col_num (int):: the number of columns of facets before they begin to wrap. As in, with the default of col_num = 3, then the fourth facet will be on the second row of the facet grid.
filename (str):: the name to the save a .png file of the plot unde rin /Spatial_plots. If not provided (default value), the plot will not be written to the disk.

Returns:

matplotlib.figure (the boxplot)

run_edt_statistics(groupby_column: str, marker_class: str = 'spatial_edt', N_column: str = 'sample_id', statistic: str = 'mean', test: str = 'anova', filename: str | None = None) → pandas.DataFrame

A wrapper on do_state_exprs from the palmettobug.Analysis class, but with the default marker_class of ‘spatial_edt’, and a output folder in /Spatial_plots.

Args:

groupby_column (string):: a clustering of the data (such as ‘leiden’, ‘merging’, etc.)
marker_class (string):: ‘spatial_edt’ (default), ‘type’,’state’,’none’,’All’ – if specifying any marker_class except ‘spatial_edt’, then there is little reason to use this function, as you could just use palmettobug.Analysis.do_state_exprs
statistic (string):: ‘mean’ or ‘median’ – which aggregation method to use when calculating the average value in each ROI / sample_id
test (string):: ‘anova’ or ‘kruskal’ – whether to use an ANOVA or a Kruskal-Wallis test to do the stats
filename (string or None):: If not None, specifies the filename to save the statistics table under (as a CSV) in the /Spatial_plots folder.

Returns:

a pandas dataframe, containing the statistics