palmettobug.Analysis_functions.Analysis
This module contains a single Analysis Class which handles the back-end of the main CATALYST-style Analysis pipeline of PalmettoBUG, and is available in the public (non-GUI) API of PalmettoBUG.
This is used in the GUI by the fourth tab of the program.
Classes
This class is essentially a python port of CATALYST -- but with certain differences, include slightly different calculations / normalizations, |
Module Contents
- class palmettobug.Analysis_functions.Analysis.Analysis(in_gui=False)
This class is essentially a python port of CATALYST – but with certain differences, include slightly different calculations / normalizations, additional functions, and missing functions.:
There are a few broad types of methods, "load_" , "do_" , and "plot_". Methods starting with "do_" tend to execute a transformation or a calculation on the data (such as statistics, UMAP / PCA, or scaling). Those starting with "plot_" always generate a plot, usually returning a matplotlib figure.
- Args:
- in_gui (bool):
Whether this class is inside the GUI (True) or not (False). Used primarily for determining whether to have tkinter pop-up warnings (True) or print-to-console warnings (False)
Most of the critical steps in setting up an Analysis occurs in the data loading methods, not in the initialization of the class.
- Input / Output:
if a method contains a “filename” keyword arugment (with default = None), then supplying that argument will trigger the export of the method’s return data to the directory. As in, for a plotting method, supplying a filename means that the it will not only return a matplotlib figure as usual, but will ALSO export the figure to the directory as a PNG file, at:
self.save_dir/{filename}.png
Methods that return data tables are similar, but export to self.data_table_dir (not self.save_dir)
- Key Attributes:
- data (anndata.AnnData): This is an anndata object containing the numerical values of the channels in data.X, the event
anntotation in data.obs and the antigen annotations in .data.var. Pre-arcsinh transformed data lives in data.uns, but is not used for any function in this pipeline. data.obs starts out with the same information as the metadata, except each unique entry in metadata (representing a unique sample_id) is replicated across all the sample_id events (there are usually >1 cell per image!). At the same time, data.var starts out the same as panel (truly identical). As clusterings are performed, new columns can be added to data.obs that did not initially exist in the metadata
- metadata & panel (pandas dataframes): these are the metadata and panel pandas dataframes that get loaded into data.obs and
data.var & represent the metadata.csv and Analysis_panel.csv files in the directory of the Analysis.
- UMAP_embedding & PCA_embedding (anndata.AnnData): usually downsampled from data, these are anndata objects with UMAP or
PCA values for plotting in 2 dimensions
- directory (str): the path to the folder where the Analysis is initialized / performed. Used to find the input data and
set up the directories for outputs.
save_dir (string): the path to the folder where plots generated by this class are saved
data_table_dir (string): the path to the folder where datatables (such as exports or statistics) are saved
clusterings_dir (string): folder where clustering .csv files are saved and expected to be for reload
- _in_gui = False
- directory = None
- data = None
- back_up_data = None
- back_up_regions = None
- logger = None
- clusterings_dir = None
- _scaling = 'unscale'
- unscaled_data = None
- _quantile_choice = None
- input_mask_folder = None
- _distance_edt_data = None
- is_batched = 0
- load_data(directory: pathlib.Path | str, arcsinh_cofactor: int | float = 5, save_dir: str = 'Plots', data_table_dir: str = 'Data_tables', csv: str | pathlib.Path | None = None, csv_additional_columns: list = [], load_regionprops=True) None
Load the data for an analysis
- Args:
- directory (string or Path): the path to the directory where the Analysis is to be performed. If csv is None, then the expectation
is that there should be .fcs files inside a subfolder of this directory (specifically inside a /Analysis_fcs subfolder)
- arcsinh_cofactor (integer): Default is 5. If > 0, will transform data according to the following equation
>>> data = arcsinh(data / arcsinh_cofactor)
- save_dir & data_table_dir (str): these allow you to specify what self.save_dir and self.data_table_dir will be WITHIN the main
directory. By default save_dir == “Plots” and data_table_dir == “Data_tables”. If you want export outside the main directory, Set these attributes later using a full file path string.
- csv (string/Path or None): the path to a csv file containing data ready to import into PalmettoBUG (the format for this kind of data
matches what PalmettoBUG exports in an Analysis). If None (default) then presumes .fcs files are available in the appropriate folder (directory/Analysis_fcs) and will load from those files.
- csv_additional_columns (list): ONLY used if loading from csv – this is a list of non-standard column names in csv that are to be treated
as metadata (will end up in self.data.obs) and not as numerical data (destined for self.data.X). The “standard” metadata column names are those commonly encountered in PalmettoBUG operation, such as “sample_id” or “leiden”. This is mainly intended to increase flexibility in cases where PalmettoBUG is being used outside the GUI & a novel metadata category is created.
- load_regionprops (boolean): whether to load the regionprops as well. This is important if you plan on doing any spatial analysis
as this loads the centroids, etc. It does not APPEND the regionprops to the anndata object (self.data), and you must call append_regionprops in order to do that.
- Input/Output:
Input: expects either a .csv file at the path defined by the [csv] argument, or expects a folder of only .fcs files located at [directory]/Analysis_fcs.
- _load_fcs(arcsinh_cofactor: int | float = 5) None
Loads and processes .fcs files from the ‘Analysis_fcs’ directory, aligns them with metadata, applies arcsinh transformation, and stores the result in an AnnData object for downstream analysis.
- Args:
- arcsinh_cofactor (int | float): The cofactor used for arcsinh transformation of intensity values.
If set to 0 or less, no transformation is applied.
- _load_csv(csv_path: pathlib.Path | str, additional_columns: list = [], arcsinh_cofactor: int | float = 5) None
Helper for load_data that handles the loading of a csv file (this csv is usually exported from PalmettoBUG as well, and expects a particular format that PalmettoBUG can export)
- Args:
csv_path (str or Path): Full path to the CSV file containing single-cell data (can be from outside the Analysis directory).
- additional_columns (list): List of custom metadata columns to treat as metadata
(i.e., to include in obs rather than X). These must not conflict with antigen names. NOTE: Ignored if the csv contains information about type / state/ etc. in the final row (additional metadata columns are automatically identified)
- arcsinh_cofactor (int or float): If > 0, applies arcsinh transformation to expression data
using: arcsinh(data / cofactor). If 0 or less, no transformation is applied.
- load_regionprops(regionprops_directory: pathlib.Path | str | None = None, auto_panel: bool = True) pandas.DataFrame
This method handles the loading of regionprops data (only from FCS directories – directories from exported CSVs depend on regionprops data already in the CSV, if present).
- Args:
- regionprops_directory (Path, string, None):
The path to a folder containing the regionprops .csv files exported during region measurements. If None, then assumes this regionprops folder exists in the usual location of an analysis – i.e., in a /regionprops folder one folder above this class’s self.directory
- auto_panel (bool):
If True, uses the automatic type / state / none assignments for each region property and proceeds immediately into appending the regionproperties to the dataset. If False, then you can edit the returned dataframe to reflect your desired marker_class assignments, and feed that into self.append_regionprops
- Returns:
- (pandas dataframe):
an automatic Regionprops_panel.csv file (mimics an Analysis_panel.csv file, treating each region property like an antigen). Centroid-0 / centroid-1 are set to marker_class ‘none’, while all other regionprops are left as ‘type’ markers
- Input/Output:
- Input:
reads from the provided regionprops_directory. Expects only .csv files representing regionproperties – all with the same columns of data to allow concatenation – inside this folder.
- Output:
writes a file to – self.directory/Regionpprops_panel.csv – which is the same format as the Analysis_panel.csv, having a row for each “marker”, with 3 columns for its name(s) and marker_class (type/state/none). But in this case, the “markers” are not antigens, but regionproperties like eccentricity, area, etc.
- append_regionprops(regionprops_panel: pandas.DataFrame | str | pathlib.Path | None = None) None
Continuation of load_regionprops. Useful if you don’t like the automatic type / state / none assignments for each region property.
This adds the regionprops data & panel to the main anndata object in this class
NOTE: don’t call more than once!! You can duplicate data columns that way.
If regionprops_panel is left as None, will read in the Regionprops_panel from self.directory/Regionprops_panel.csv
- filter_data(to_drop: str, column: str = 'sample_id') None
This function drops all rows matching to_drop in the provided column from self.data.
- Args:
- to_drop (str):
The unique value in [column] to drop all cells with that value
- column (str):
The column in self.data.obs to use in dropping data from the analysis.
- do_COMBAT(batch_column: str, covariates=None) None
Performs scanpy’s combat implementation on self.data. See their documentation for more details
batch_column specifies a column in self.data.obs to use as the batch grouping for the correction (usually ‘patient_id’)
- do_scaling(scaling_algorithm: str = '%quantile', upper_quantile: float = 99.9, split_by_column: str = '') None
This method allows the easy scaling / unscaling of the numerical data in self.data.X. The scaling is always performed down / within columns such that different antigens end up on the same / more similar scale.
- Args:
- scaling_algorithm (string):
one of [“%quantile”, “min_max”, “standard”, “robust”, “qnorm”, and “unscale”]. If “unscale”, will undo any previous scaling – unscaled data is saved before any other scaling method is performed, allowing easy reversion and switching between scaling methods. If a scaling is ever applied after another scaling, the unscaled data is used in the calculations (it is as if the first scaling never happened). Comparison of scaling methods:
>> %quantile: This is perhaps the most common method for this kind of data. In it, each column is divided by the value of its quantile % provided in the upper_quantile argument (this would be the same as dividing by the maximum of each column if upper_quantile == 100). Then all values > 1 as reduced to 1 so that the scale of the data is constrained. This process is somewhat reminiscent of thresholding the brightness of an image by choosing a maximum threshold.
>> min_max: This scales each channel / antigen between 0 and 1 by this equation: (values - min) / (max - min). It is performed by skikit-learn’s preprocessing min_max function.
>> standard: This perform standard scaling (scaling as if the data is normally distributed with a mean of 0 and a variance of 1). It is performed by skikit-learn’s preprocessing scale function
>> robust: This performs robust scaling using skikit-learn’s preprocessing robust_scale function. It is more resistant to outliers & does not try to scale to normality, unlike standard scaling.
- >> qnorm: This method is known for its use in large genomics studies, and uses a particular quantile-based scaling method.
implemented by: https://github.com/Maarten-vd-Sande/qnorm.
- upper_quantile (float):
ONLY USED with scaling_algorithm == “%quantile”. Determines the upper quantile percentage used in that scaling method
- split_by_column (string):
If not == “”, then will attempt to find a columnin self.data.obs matching the provided value, then will split the dataset by unique groups in that column and will perform the selected scaling WITHIN those groups individually, and on the entire dataset at once.
- do_leiden_clustering(seed: int = 1234, marker_class: str = 'type', min_dist: float = 0.1, n_neighbors: int = 15, resolution: int = 1, flavor: str = 'leidenalg', try_from_umap_embedding: bool = False) None
Creates a UMAP from all the cells in the dataset and then performs leiden clustering. An alternative to FlowSOM for clustering cells.
- Args:
- seed (int):
The random seed for all non-deterministic steps in the clustering pipeline.
- marker_class (string):
what channels/antigens to use in the clustering (“type”, “state”, “none”, or all)
- min_dist (float):
used in constructing the umap on which the leiden clustering will be performed.
- n_neighbors (integer):
used in contructing the nieghbors on which the umap is constructed.
- resolution (integer):
used in the ledien clustering itself. Higher numbers favor the finding of more clusters.
- try_from_umap_embedding (boolean):
if a UMAP of the entire dataset has been previously performed, set this to True to skip the time-consuming steps required for UMAP, and simply use the previously calculated dimensionality reduction. Will not filter for marker_class (assumes that was already done in the creation of the UMAP)
- Returns:
True or False, depending on whether the marker_class chosen exists in the panel
- do_flowsom(marker_class: str = 'type', n_clusters: int = 20, XY_dim: int = 10, rlen: int = 15, scale_within_cells: bool = True, seed: int = 1234) flowsom.FlowSOM
Executes FlowSOM clustering on the data.
- Args:
- marker_class (string):
what antigens / channels to use in clustering (“type”, “state’ , “none”, or All).
- n_clusters (integer):
The final number of metaclusters that cells will be classified into in the “metaclustering” column. This is achieved by merging the over-clustering produced by the SOM (the values in the “clustering” column) down to this number.
- XY_dim (integer):
This determines dimensions / points in the initial grid of the self-organizing map, and thereby the initial number of clusters before merging into metaclusters. Specifically, XY_dim*XY_dim will equal the number of initial points in the grid (X & Y dimensinos are often allowed to be specified separately, perhaps I will restore that ability, but don’t see really any circumstances where having different X / Y dimensions would be desirable)
- rlen (integer):
The number of training iterations. Higher numbers tend to fit the FlowSOM closer to the data / create a more stable FlowSOM output (less variation by random seed). However more training iterations takes more time to run.
- seed (integer):
the random seed for reproducibility of FlowSOM (which is a non-deterministic algorithm)
- Returns:
(FlowSOM) The trained FlowSOM object, useful for accessing the various techniques & visualizations available in the FlowSOM package such as minimum spanning trees, etc.
- _plot_stars_CNs(fs: flowsom.FlowSOM, filename: str | None = None) matplotlib.pyplot.figure
Plots the minimum spanning tree / star plot from the FlowSOM package
- Args:
- fs (flowsom.FlowSOM):
Returned by the self.do_flowsom method.
- filename (str or None):
if not None, then the filename to save the plot under (as a png) in the self.save_dir folder
- Returns:
a matplotlib.pyplot figure
- do_regions(region_folder: pathlib.Path | str) pandas.DataFrame
(Modified from mode_classify_folder) function to classify cells by the region and sample_id they are in. As in, for every matching image in the mask and region folders, looks at the cells in the mask image – for every cell it will check if that cell lies within a region of the region image (the mode of its pixels lies within a region with value > 0). Then will assign a label to that cell: 0 if outside a region, or {region#}_{image#} if it does lie within a region. These labels are accumulated into a list which is appended to the Analysis Object
- _assign_regions(mask: numpy.ndarray[float | int], region_map: numpy.ndarray[int], image_number: str | int) tuple[numpy.ndarray[float], numpy.ndarray[int], pandas.DataFrame]
This function iterates through two matching-sized numpy arrays (one representing cell masks & one representing region of the image [these regions are also masks, with background pixels of value == 0]), and returns a list of assigned regions to each of the cells (‘0’ if not within a region, and {region#}_{image#} if within a region, such as ‘2_3’ for region 2 of the third image). The image number must be passed into the function.
- _do_spatial_leiden(n_neighbors: int = 15, resolution: int = 1, random_state: int = 42) None
This function takes the centroid information from regionprops (centroid-0 and centroid-1) and calculates a neighborhood graph / leiden clustering for that. This is similar to the use of leiden on UMAPs, just in this case the input to the UMAP is only the physical X / Y coordinates of the centroids.
Appends the resulting spatial clustering – which is calculated per image – to self.data.obs in the format f”{image number}_{cluster number}”
Uncertain how useful this is, but it is available
- plot_cell_counts(group_by: str = 'sample_id', color_by: str = 'condition', filename: str | None = None, **kwargs) matplotlib.pyplot.figure
Plots cell counts as a bar plot
- Args:
- group_by (str):
The column in self.data.obs to use to group / divide the bars of the plot.
- color_by (str):
The column in self.data.obs used to color the bars of the plot
- filename (str or None):
if not None, then the filename to save the plot under (as a png) in the self.save_dir folder
- Returns:
a matplotlib.pyplot figure
- plot_MDS(marker_class: str = 'type', color_by: str = 'condition', print_stat: bool = False, seed: int = 42, filename: str | None = None, **kwargs) tuple[matplotlib.pyplot.figure, pandas.DataFrame]
Plots an MDS embedding of the sample_ids in the dataset as a scatterplot, only using the antigens with marker_class [antigens_to_show] in the panel and colored by [color_by]
- Args:
- marker_class (str):
Either “type”, “state”, “none”, or “All”. Which antigens (see self.data.var) to use to calculate & create the MDS plot
- color_by (str);
which column in self.data.obs to use to color the samples.
- print_stat (bool):
whether to export the MDS embedding (True) to self.data_table_dir or not (False, default)
- filename (str or None):
if not None, then the filename to save the plot under (as a png) in the self.save_dir folder
- kwargs:
are passed to seaborn.scatterplot()
- Returns:
a matplotlib.pyplot figure and a pandas dataframe
- plot_NRS(marker_class: str = 'type', filename: str | None = None, **kwargs) matplotlib.pyplot.figure
Plots the non-redundancy scores of each antigen in the category specified in [marker_class] as boxplots, with the distribution deriving from the NRS scores from each sample_id
- Args:
- marker_class (str):
Either “type”, “state”, “none”, or “All”. Which antigens (see self.data.var) to use to calculate the NRS and plot
- filename (str or None):
if not None, then the filename to save the plot under (as a png) in the self.save_dir folder
- kwargs:
are passed to seaborn.boxplot()
- Returns:
a matplotlib.pyplot figure
- plot_ROI_histograms(color_by: str = 'condition', marker_class: str = 'All', suptitle: bool = True, filename: str | None = None, **kwargs) matplotlib.pyplot.figure
Plot kde-smoothed histograms of each antigen / channel’s expression, with separate lines for separate ROIs, colored by [color_by]
- Args:
- color_by (str):
which column in self.data.obs to color the histogram tracings by
- marker_class (str):
Either “type”, “state”, “none”, or “All”. Which antigens (see self.data.var) to display in the plot
- suptitle (bool):
whether to attempt to add a title to the plot automatically.
- filename (str or None):
if not None, then the filename to save the plot under (as a png) in the self.save_dir folder
- kwargs:
are passed to matplotlib.pyplot.axis.plot()
- Returns:
a matplotlib.pyplot figure
- do_UMAP(marker_class: str = 'type', cell_number: int = 1000, seed: int = 0, n_neighbors: int = 15, min_dist: float = 0.1, **kwargs) None
Perform the calculations for a UMAP embedding.
- Args:
- marker_class (string):
none, type, state, or ALL >> what markers/antigens to use in the UMAP algorithm
- cell_number (integer):
The downsampling number. No more than this number of cells will be randomly taken from each sample_id in the process of downsampling
- seed (integer):
The random seed used for reproducibility in downsampling and running the UMAP
- kwargs:
passed as kwargs into scanpy.tl.umap() call
- do_PCA(marker_class: str = 'type', cell_number: int = 1000, seed: int = 0) None
Perform the calculations for a PCA embedding.
- Args:
- marker_class (string):
none, type, state, or ALL >> what markers/antigens to use in the UMAP algorithm
- cell_number (integer):
The downsampling number. No more than this number of cells will be randomly taken from each sample_id in the process of downsampling
- seed (integer):
The random seed used for reproducibility in downsampling and running the UMAP
- _downsample_for_UMAP(anndata_in: anndata.AnnData, max_number: int = 1000, seed: int = 42) anndata.AnnData
Helper for do_UMAP and do_PCA methods, performs the downsampling of the data, where no more than the supplied max_number of cells will be randomly sampled from each sample_id of the anndata_in object. Returns the downsampled data as an anndata object.
- plot_scatter(antigen1: str, antigen2: str, hue: str | None = None, filename: str | None = None, size: int | float = 1, alpha: int | float = 0.5, **kwargs) matplotlib.pyplot.figure
Makes a scatterplot of [antigen1] vs. [antigen2], colored by [hue]. Will write a png file from the plot to self.save_dir if filename is not None.
- Args:
- antigen1 (str):
The antigen (in self.data.var[‘antigen’]) to plot along the x-axis of the plot
- antigen2 (str):
The antigen (in self.data.var[‘antigen’]) to plot along the y-axis of the plot
- hue (str):
If not None, either in self.data.var[‘antigen’], self.data.obs.columns, or “Density”. If None, then no color applied to points in the scatter. If in self.data.var[‘antigen’], points will be colored by the expression of the provided antigen. If in self.data.obs.columns, points will be colored by category in that column. If ‘Density’, will attempt to color the plot based on the density of points at that location on the plot
- filename (str or None):
if not None, then the filename to save the plot under (as a png) in the self.save_dir folder
- size (numeric):
the size of the points in the plot
- alpha (numeric between 0-1):
the transparency of points in the plot. Number closer to 1 mean less transparent points, and vice versa
- kwargs:
are passed to seaborn.scatterplot()
- Returns:
a matplotlib.pyplot figure
- plot_UMAP(color_by: None | str = 'metaclustering', palette=None, filename: str | None = None, **kwargs) matplotlib.pyplot.figure
Plots a UMAP embedding as a scatterplot, colored by [color_by]. Primarily a wrapper on scanpy.pl.umap() method See that method’s information for more details: https://scanpy.readthedocs.io/en/stable/api/generated/scanpy.pl.umap.html
- Args:
- color_by (str or None):
Either: 1). what column in self.data.obs to color the UMAP cells by 2). what antigen in self.data.var[‘antigen’] to color the UMAP by, or 3). None to have no coloring of points
- palette:
how to color the points. See matplotlib colormaps, or the scanpy link above for more details. Example: ‘tab20’ is a colormap that can be good for plots using a categorical variable (one of self.data.obs columns) to color the cells, while ‘viridis’ or ‘coolwarm’ can be good for continuous variable (one of self.data.var[‘antigen’].unique())
- filename (str or None):
if not None, then the filename to save the plot under (as a png) in the self.save_dir folder
- kwargs:
are passed to scanpy.pl.umap()
- Returns:
a matplotlib.pyplot figure
- plot_PCA(color_by: str = 'metaclustering', palette=None, filename: str | None = None, **kwargs) matplotlib.pyplot.figure
Plots a PCA embedding as a scatterplot, colored by [color_by]. Primarily a wrapper on scanpy.pl.umap() method. Even though PCa does not use a scanpy function, self.PCA_embedding is set up in such a way that scanpy.pl.umap() can be used to plot it. See that method’s information for more details: https://scanpy.readthedocs.io/en/stable/api/generated/scanpy.pl.umap.html
- Args:
- color_by (str or None):
Either: 1). what column in self.data.obs to color the UMAP cells by 2). what antigen in self.data.var[‘antigen’] to color the PCA by, or 3). None to have no coloring of points
- palette:
how to color the points. See matplotlib colormaps, or the scanpy link above for more details. Example: ‘tab20’ is a colormap that can be good for plots using a categorical variable (one of self.data.obs columns) to color the cells, while ‘viridis’ or ‘coolwarm’ can be good for continuous variable (one of self.data.var[‘antigen’].unique())
- filename (str or None):
if not None, then the filename to save the plot under (as a png) in the self.save_dir folder
- kwargs:
are passed to scanpy.pl.umap()
- Returns:
a matplotlib.pyplot figure
- plot_facetted_DR_by_antigen(marker_class: list = ['type', 'state'], kind: str = 'UMAP', suptitle: bool = True, number_of_columns: int = 3, filename: str | None = None, **kwargs) matplotlib.pyplot.figure
Like the plot_facetted_DR method below, but specific to when you want to facet by the antigens & color each facet by the respective antigen. Notably, this method does not take a color / hue parameter, nor does it need a facetting column, as the assumption of this funciton being called is that the antigens are being used for both.
- Args:
- marker_class (list of str);
A list of the valid marker_class values in the analysis (self.data.var[‘marker_class’, or ]”All”, “none”, “type”, “state”, “spatial_edt”, …). For each of the marker_classes listed, the antigen’s for that class will be included in the final UMAP. This is inclusive, so if “All” if in the list it doesn’t matter what other classes are listed – every antigen will be used. Default is [‘type’, ‘state’] so that all except ‘none’ antigens will be displayed in most cases
- kind (str):
“umap” or “pca” – which type of dimensionality reduction is to be used.
- suptitle (bool):
whether to attempt to automaticaly place a title on the whole plot (instead of only each facet getting a title). Not that this suptitle can frequently be oddly placed since the number of facets in the plot changes where the title would most comfortably sit.
- number_of_columns (integer):
How many columns to have in the grid of the plot. The number of rows is automatically determined form this number and the number of facets required to plot every antigen.
- filename (str or None):
if not None, then the filename to save the plot under (as a png) in the self.save_dir folder
- kwargs:
are passed to matplotlib.pyplot.axis.scatter()
- Returns:
a matplotlib.pyplot figure. Note that, unlike the subsequent facetted DR method, this plot will contain EVERY cell in the embedding in EVERY facet, just the color applied to the points in each plot facet will be different.
- plot_facetted_DR(color_by: str, subsetting_column: str, kind: str = 'UMAP', suptitle: bool = True, number_of_columns: int = 3, color_bank: list[str] | None = None, filename: str | None = None, **kwargs) matplotlib.pyplot.figure
Plots a dimensionality reduction embedding (kind = ‘PCA’ or ‘UMAP’), facetted by the supplied [subsetting_column], each UMAP colord by the supplied [color_by].
- Args:
- color_by (str):
what column in self.data.obs, or which antigen name (values in self.data.var[‘antigen’], used to select the expression data in matching column) to color the scatter plots in each facet with.
- subsetting_column (str):
the column in self.data.obs on which to facet the plot. For every unique value in this column, a separate UMAP plot will be created, containing only the cells with that unique value displayed. Additionally, the first plot in the facet grid will always be a dimensinoality reduction plot containing all the cells. NOTE: the plots will all have the same DR embedding, as dimensionality reduction IS NOT RUN for each subset of cells, instead the last DR (one the whole / downsampled data) of the proper kind will be the embedding used for every cell
- kind (str):
“UMAP” or “PCA” – which kind of dimensionality reduction to attempt to plot
- suptitle (bool):
whether or not ot include an automatically generated title. Default = True, but may want to set to False if the suptitle is being placed in the wrong location on the plot (as happens when there are a very large number of subplots)
- number_of_columns (int):
how many columns to use in the figure’s grid. The number of rows will be automatically determined from this. If the number of total panels or rows is too low, this number may be reduced automatically.
- color_bank (list of strings or None):
a list of strings representing colors that can be recognized by matplotlib.Patch, used to determine the colors on the plot for each group in the color_by column, ONLY if color_by is in self.data.obs.columns and NOT if colorby is in self.data.var[‘antigen’]
- filename (str or None):
If not None, then will attempt to write the figure produced to self.save_dir/{filename} INCLUDE the file extension in this string! (usually .png)
- kwargs:
keyword arguments passed on to each matplotlib.axis.scatter() call.
- Returns:
a maptlotlib figure, the facetted plot of UMAPs. The first UMAP is always the un-facetted (all data together) UMAP
- Inputs / Outputs:
- Outputs:
if filename is provided, then the matplotlib figure will also be written as a file to – self.save_dir/{filename}
- plot_medians_heatmap(marker_class: str = 'type', groupby: str = 'metaclustering', scale_axis: None | int = 0, subset_df: pandas.DataFrame = None, subset_obs: pandas.DataFrame = None, colormap='coolwarm', figsize: tuple[int | float, int | float] = (10, 10), filename: str | None = None, **kwargs) matplotlib.pyplot.figure
Plots a heatmap in a manner similar to CATALYST by first taking the median of each channel in each category of [groupby] column, then %quantile normalizing the medians from 1%-99% across the antigens.
- Args:
- filename (string):
the filename for exported heatmap
- marker_class (string):
none, type, state, or All >>> what markers / antigens to use in the heatmap
- groupby (string):
a name of a column in self.data.obs to group the data by (usually ‘metaclustering’,’clustering’,’merging’, ‘classification’,’leiden’). groupby can be:
“metaclustering” –> cluster heatmap “sample_id” –> heatmap by ROI “merging” / etc. –> heatmap by arbitrary column in self.data.obs
- scale_axis (integer or None):
Either None, 0 or 1 -> Which axis of the final median array to scale along before plotting. Default is 0, to scale within antigens. (0 –> scale within antigen, 1 –> scale within groupby categories, None –> scale medians across the entire array)
- subset_df (pandas DataFrame or None):
a dataframe equivalent to self.data.X with column names = self.data.var.index allows custom / transformed / subsetted data to be introduced into this plotting method without needing to edit / transform self.data directly. If None, then self.data will be used to create the plot and subset_obs argument will be ignored. Requires a paired subset_obs dataframe.
- subset_obs (pandas dataframe or None):
an equivalent to self.data.obs, paired with subset_df argument
- figsize (tuple of numerics):
X / Y dimension sizes of the plot
- kwargs:
passed in seaborn.clustermap() call
- Returns:
a matplotlib.pyplot figure
- _plot_facetted_heatmap(filename: str, subsetting_column: str, groupby_column: str = 'metaclustering', marker_class: str = 'type', number_of_columns: int = 3, suptitle: bool = True, **kwargs) str
Calls plot_medians_heatmap iteratively to plot a facetted heatmap, facetted on the unique categories in [subsetting_column].
Unique in that this function only exports an .SVG file to the disk and return only the path to that file (does not return the plot like the other functions)
This function is old, and not well-tested / supported so it may have errors! Also this depends on svg_stack, which is no longer a mandatory dependency of PalmettoBUG
- do_cluster_merging(file_path: str | pathlib.Path, groupby_column: str = 'metaclustering', output_column: str = 'merging') None
Creates a “merging”” column inside self.data.obs by merging & annotating an existing column in self.data.obs [groupby_column]
- Args:
- file_path (str):
The full file path to a .csv file. This csv file will be read-in as a pandas dataframe. This dataframe is expected to have at least two columns:
– “original_cluster”
—“new_cluster”
- groupby_column (str):
the name of the column in self.data.obs whose values are being merged / annotated. The unique values in this column should correspond to the values in the ‘original_cluster’ column of the read-in dataframe described above. Usually, this is either “metaclustering” or “leiden” but it does not have to be
- output_column (str):
the name of a new column that will be inserted into self.data.obs. This column will contain the annotated / assigned values from the read-in dataframe. As in, the “original_cluster” values in groupby_column will be replaced with their corresponding “new_cluster” values and the new column added as self.data.obs[output_column]
- plot_cluster_distributions(groupby_column: str = 'metaclustering', marker_class: str = 'type', plot_type: str = 'violin', comp_type: str = 'raw', filename: str | None = None, **kwargs) matplotlib.pyplot.figure
Plot the distribution of marker expression within groups of cells. Violin or bar plots.
- Args:
- groupby_column (string):
The column in self.data.obs to group the cells by (usually a way of identifying cell types, like metaclustering or merging, but can be a different grouping like sample_id)
- marker_class (string, “All”,”type”,”state”, or “none”):
what type of antigen to include in the plot
- plot_type (string, “violin” or “bar”):
whether to plot a violin or bar plot
- comp_type (string, “vs” or “raw”):
whether to display the raw values of marker expression or to display the difference between each cluster and the rest of the dataset. As in, if == “vs”, then the data for each cluster will have the mean expression of rest of the clusters substracted from it before plotting.
- filename (string, or None):
If not None, the name of the .png file to be saved in experiment.save_dir
- kwargs:
passed into seaborn.catplot() call
- Returns:
a matplotlib figure
- plot_cluster_histograms(antigen: str, groupby_column: str = 'metaclustering', filename: str | None = None, **kwargs) matplotlib.pyplot.figure
Plots kde-smoothed histogram of a particular marker / antigen’s expression across all the clusters in the supplied [groupby_column] column
- Args:
- antigen (str):
one of the values in self.data.var[‘antigen’]. Determines which antigen in the dataset to plot
- groupby_column (string):
The column in self.data.obs to group the cells by (usually a way of identifying cell types, like metaclustering or merging, but can be a different grouping like sample_id). Creates facets of the plot
- filename (string, or None):
If not None, the name of the .png file to be saved in experiment.save_dir
- kwargs:
passed into matplotlib.pyplot.axis.plot() for each facet of the plot
- Returns:
a matplotlib figure
- plot_cluster_abundance_1(groupby_column: str = 'metaclustering', bars_by: str = 'sample_id', number_of_columns: int = 3, filename: str | None = None, **kwargs) matplotlib.pyplot.figure
Plots a stacked barplot (where the stacks all add up to 1) of the ratios of each cell type from the supplied [groupby_column] column in each sample_id, facetted by condition.
- Args:
- groupby_column (str):
The name of a column in self.data.obs to divide the stacks of the barplot by NOTE: the bars of the barplot are ALWAYS separated by self.data.obs[‘sample_id’]
and the plot is ALWAYS facetted into multiple panels on self.data.obs[‘condition’]
- number_columns (integer):
How many columns in the plot / when to warp the facets of the plot. For example, if your dataset has 5 conditions, and you supply a value == 3 here, then the first three conditions will be plotted in the first row and the remaining two conditions will be plotted in the second row.
- filename (string, or None):
If not None, the name of the .png file to be saved in experiment.save_dir
- kwargs:
passed into seaborn.objects.Plot()
- Returns:
a matplotlib figure
- plot_cluster_abundance_2(groupby_column: str = 'metaclustering', N_column: str = 'sample_id', hue: str = 'condition', plot_type: str = 'barplot', filename: str | None = None, **kwargs) matplotlib.pyplot.figure
Plots the abundance of each celltype (from the supplied [groupby_column] column in self.data.obs) in each sample id as each a bar, box, or a stripplot (with plot_type == “barplot”,”boxplot”,”stripplot”).
Separate boxplot / stripplots are made from each condition in the supplied [hue] column to allow comparisons.
- Args:
- groupby_column (str):
The name of a column in self.data.obs to facet the bar / box / strip plot into multiple panels
- N_column (str):
The name of the column in self.data.obs that determines what individual units compose the distribution of the boxplot. This function does not do a statistical test, but the groups of this column would correspond to the N used to determine variance / degrees of freedom in a t-test. NOTE: a key assumption is that the categories in this column are NEVER shared between hue categories. This holds for the defaults (each unique ROI / sample_id can only have one condition assigned to it) but must also be true for any alternate column used.
- hue (str):
The name of a column in self.data.obs to separate & color columns of the plots by
- plot_type (str):
either “barplot”, “boxplot”, or “stripplot”. Determines which type of plot use on each sub-panel.
- filename (string, or None):
If not None, the name of the .png file to be saved in experiment.save_dir
- kwargs:
passed into seaborn.{bar/box/strip}plot()
- Returns:
a matplotlib figure
- do_cluster_stats(groupby_column: str = 'metaclustering', N_column: str = 'sample_id', marker_class: str = 'type') dict[str | int, pandas.DataFrame]
Calculated statistics by pairwise ANOVAs (effectively a t-test) between each cluster’s marker expression and the marker expression of the rest of the dataset. Instead of using all the cells individually, an average is taken of each sample_id
- Args:
- groupby_column (string):
The column in the self.data.obs dataframe to group the cells for making comparison between unique value in this column (usually a celltype column, like “metaclustering”, but could be something else, like condition or sample_id)
- N_column (string):
The column in self.data.obs that determines the “N” for the statistical test (data is aggregated by this before the test and it helps determine what the degrees of freedom are in the test.) NOTE: unlike other instances of N_column in palmettobug functions, it is possible groups within this column to be shared within the conditions, as the comparison of interest is usually on the cell type level, not between conditions.
- marker_class (string == “All”, “type”, “state”, or “none” ):
what markers to include in the comparison. Usually “type”, should typically match the markers used to generate the cell clustering / groupby being compared.
- Returns:
a dictionary with keys = unique values of the groupby_column, and values = pandas dataframes containing the statistics for that groupby_column value. This dictionary is also saved as self.df_out_dict, from which it is accessed by the self.plot_cluster_stats method
- plot_cluster_stats(statistic: str = 'FDR_corrected', filename: str | None = None, **kwargs) matplotlib.pyplot.figure
Plots a heatmap of from cluster statistics calculated with the method self.do_cluster_stats.
- Args:
- statistic (str):
which column of the output of self.do_cluster_stats() (aka, self.df_out_dict) to plot. Can be “F_statistic”, “p_values”, or “FDR_corrected” p-value stats will be transformed by the -log(stat) before plotting, so that higher values correspond with greater significance
- filename (str, or None):
if not None, will determine the filename of the plot saved to self.save_dir
- Returns:
a matplotlib.pyplot
- do_abundance_ANOVAs(groupby_column: str = 'merging', variable: str = 'condition', N_column: str = 'sample_id', conditions: list[str] = [], filename: str | None = None) pandas.DataFrame
Performs pairwise ANOVA tests (or effectively, a t-test) between two provided conditions in the self.data.obs[‘condition’] column, looking at the abundance (as % in each sample_id) of the cell types specified in the groupby_column.
- Args:
- groupby_column (str):
The column in self.data.obs where the cell type information is contained
- variable (str):
The column in self.data.obs where the independent variable information is found (default = ‘condition’)
- N_column (str):
The column in self.data.obs that determines the aggregation (and downstream from this, the degrees of freedom) for the statistical test. NOTE: a key assumption is that the categories in this column are NEVER shared between conditions – aggregation on this column is done BEFORE comparison of conditions. This holds for the defaults (each unique ROI / sample_id can only have one condition assigned to it) but must also be true for any alternate column used.
- conditions (list of strings or empty list):
list of unique values in self.data.obs[variable] to be compared by ANOVA if None, then wil perform an ANOVA test on all the conditions in the dataset.
- filename (str or None):
if not None, determines the filename that the output dataframe will be saved to inside the self.data_table_dir folder.
- Returns:
(pandas dataframe) representing the statistics calculated by this function
- do_count_GLM(conditions: list[str], variable: str = 'condition', groupby_column: str = 'merging', N_column: str = 'sample_id', family: str = 'Poisson', filename: str | None = None) pandas.DataFrame
Performs a statistical test on cell abundance / cell count using generalized linear models.
Cell counts are taken for each sample_id in each condition and then those aggreagated per-sample_id numbers are used in the GLM.
- Args:
- conditions (list of strings):
conditions to compare. In GUI, either pairwise or all possible conditions at once are compared.
- variable (string):
the column in self.data.obs that will be treated as the independent variable for the test. Almost always ‘condition’
- groupby_column (string):
the column in self.data.obs that contains the cell type information from which counts / abundance will be calculated
- N_column (string):
the column in self.data.obs that contains the replication N grouping (data is aggregated by this grouping before the statistical test, and relates to the number of degrees of freedom in the test). Usually only sample_id or patient_id. NOTE: a key assumption is that the categories in this column are NEVER shared between conditions – aggregation on this column is done BEFORE comparison of conditions. This holds for the defaults (each unique ROI / sample_id can only have one condition assigned to it) but must also be true for any alternate column used.
- family (string – “Poisson”, “NegativeBinomial”):
The distribution to use in the GLM. Can be “Poisson” or “NegativeBinomial”. Other distributions, such as “Gaussian” and “Binomial” are not recommended or not currently configured properly.
- filename (string or None):
the filename for the csv exported into self.data_table_dir. If None, no such file is exported
- Returns:
pandas dataframe: Summary statistics from the results of the model
- Inputs/Outputs:
- Outputs:
If filename is provided (is not None), then exports the summary statistic table to self.data_table_dir/filename.csv
- plot_state_distributions(marker_class: str = 'state', subset_column: str = 'merging', colorby: str = 'condition', N_column: str = 'sample_id', grouping_stat: str = 'median', wrap_col: int = 3, suptitle: bool = False, figsize: tuple[int | float, int | float] = None, filename: None | str = None) matplotlib.pyplot.figure
Plots a facetted boxplot of the expression of a specified marker_class (usually ‘state’), split into various cell groupings (subset_column, usually ‘merging’) per panel, comparing on colorby (usually ‘condition’). Aggregates within each sub-group first by N_column (usually ‘sample_id’) using the aggregation statistic specified in grouping_stat, so that the boxplots aren’t overwhelmed trying to plot thousands of individual cells.
- Args:
- marker_class (string):
What marker_class of antigens to use in the plot. Either ‘type’,’state’ (default), ‘None’, or ‘All’.
- subset_column (string):
The name of a categorical column in self.data.obs to group the cells by. These groupings will constitute the panels of the final plot.
- colorby (string):
The name of a categorical column in self.data.obs to group the cells by, typically ‘condition’. These groups will define how the boxplots in each panel are colored.
- N_column (string):
The name of a categorical column in self.data.obs to group the cells by, typically ‘sample_id’. It is recommended to not change this as errors / strange looking plots are likely with any other value. It specifies how the data is aggregated before plotting, as plotting every cell for a large dataset is likely to make the boxplot too confusing, as there can be far too many outlier points on the plot. NOTE: a key assumption is that the categories in this column are NEVER shared between conditions – aggregation on this column is done BEFORE comparison of conditions. This holds for the defaults (each unique ROI / sample_id can only have one condition assigned to it) but must also be true for any alternate column used.
- grouping_stat (string):
How to aggregate the data using the N_column parameter – as in, take the ‘mean’ of the sample_id’s or the ‘median’ before plotting?
- wrap_col (integer):
how many panels per column of the facetted plot before wrapping and starting a new row of boxplots
- suptitle (boolean):
whether to include an automatically generated title at the top of the boxplot or not
- figsize (tuple of two numerics):
The size, in inches, of the final plot’s dimensions. Used in the matplotlib.pyplot.subplots() function
- filename (None, or string):
If not None, then this method will write the plot as a .png file to the folder specificed by self.save_dir using the provided filename. This filename should not include the file extension (the extension is always .png, and is automatically supplied by this method). If None, then the figure is not written to the hard drive.
- Returns:
matplotlib.pyplot figure
- Inputs/Outputs:
- Outputs:
If filename is provided (is not None), then exports the figure as a .png file
- plot_state_p_value_heatmap(stats_df: None | pandas.DataFrame = None, top_n: int = 50, heatmap_x: list[str] = ['condition', 'sample_id'], ANOVA_kwargs: dict = {}, include_p: bool = True, figsize: tuple[int | float, int | float] = (10, 10), filename=None) matplotlib.pyplot.figure
Plots a heatmap of the top most significantly differences found with the self.do_state_exprs_ANOVAs() method
Presumes the supplied stats_df matches the format exported by self.do_state_exprs_ANOVAs() method! Including structure, columns, rank ordering by F-statistic top-to-bottom, etc.
- Args:
- stats_df (None, or a pandas dataframe):
A pandas dataframe with marker expression statistics – the returned output of the self.do_state_exprs_ANOVAs() method. If None, then stats_df = self.do_state_exprs_ANOVAs(**ANOVA_kwargs) will be run to generate the statistics dataframe.
- top_n (integer):
How many of the top (order by F-statistic) antigen expression changes to plot on the heatmap. Default = 50
- ANOVA_kwargs (dictionary):
Only used if stats_df is None. Provides the parameters of self.do_state_exprs_ANOVAs method, as in stats_df = self.do_state_exprs_ANOVAs(**ANOVA_kwargs) will be run first before the heatmap is generated from the stats_df.
- heatmap_x (list of strings):
A list of column names in self.data.obs that determine how the data will be grouped for the x-axis of the heatmap. The values of the heatmap tiles are the median expression of the antigen of interest in these groups. NOTE: The y-axis of the heatmap is already determined by antigen/cellgrouping pairs in stats_df, and if the cell grouping used to calculate statistics was not the entire dataset, then it will also be used in grouping the data to calculate medians for the heatmap, along with the columns specified by this parameter.
As in, let’s say state marker statistics were calculated between cell types defined in a ‘merging’ column, while heatmap_x was set to be [‘condition’,’sample_id’] (the default) to group the data by each ROI, along with its treatment label –> Then, on the heatmap, the data will be grouped by sample_id, condition, and merging – then the median taken of those groups and plotted on the heatmap.
- include_p (boolean):
whether to include an additional column of the heatmap for the p-values associated with each row of the statistics calculated in the stats_df. This column’s values do not come ffrom the groupings explained above, but directly from the adjusted p-values of the statistics table, transformed as follows:
heatmap_value = -Log(adj_p_value)
NOTE that the negative log of 0.05 is ~1.3.
- figsize (tuple of 2 numerics):
The dimensions of the final plot, in inches. Used in the matplotlib.pyplot.subplots call
- filename (None, or string):
If not None, then this method will write the plot as a .png file to the folder specificed by self.save_dir using the provided filename. This filename should not include the file extension (the extension is always .png, and is automatically supplied by this method). If None, then the figure is not written to the hard drive.
- Returns:
matplotlib.pyplot figure
- Inputs/Outputs:
- Outputs:
If filename is provided (is not None), then exports the figure as a .png file
- do_state_exprs_ANOVAs(marker_class: str = 'state', groupby_column: str = 'merging', variable: str = 'condition', N_column: str = 'sample_id', statistic: str = 'mean', test: str = 'anova', conditions: list[str] = [], filename: str | None = None) pandas.DataFrame
Performs statistical comparison of marker expression within cell types between conditions using ANOVA
Aggregates marker expression using mean or median within each unique [sample_id + groupby] column combination and then compares across conditions
- Args:
- marker_class (str):
one of – “All”, “type”, “state”, “none” – determines what markers are compared. See: self.data.var[‘marker_class’] or the Analysis_panel file
- groupby_column (str):
the column title of the cell type column in self.data.obs. Usually a string, but theoretically could be any allowed in a pandas dataframe column title.
- variable (str):
the column of self.data.obs containing the independent variable / condition. Default = ‘condition;
- N_column (stR):
the column of self.data.obs that carries the experimental unit. i.e., the data will be aggregate based on this column to construct the distributions of the final statistical comparison and the number of degrees of freedom in the test could be described as:
degrees_of_freedom = len(self.data.obs[N_column].unique()) - len(self.data.obs[variable].unique())
As in, N - the number of comparisons. NOTE: a key assumption is that the categories in this column are NEVER shared between conditions – aggregation on this column is done BEFORE comparison of conditions. This holds for the defaults (each unique ROI / sample_id can only have one condition assigned to it) but must also be true for any alternate column used.
- statistic (str):
one of – “mean”, “median” – which aggregation statistic to use
- test (str):
‘anova’ or ‘kruskal’ – The statistical test to perform (ANOVA or Kruskal-Wallis)
- conditions(list of str):
if empty (default) will use all the unique condiitons in self.data.obs[variable]. Otherwise, will only compared the conditions in the this list – values in this list should be values in self.data.obs[variable], values not in this will be ignored.
- filename (str or None):
If not None, the name for saving the output datatable as a csv file in self.data_table_dir
- Returns:
(pandas dataframe) the summary statistics of the ANOVA tests
- Inputs/Outputs:
- Outputs:
If filename is provided (is not None), then exports the summary statistic table to self.data_table_dir/filename.csv
- export_data(subset_columns: list[str] | None = None, subset_types: list[list[str]] | None = None, groupby_columns: list[str] | None = None, statistic: str = 'mean', groupby_nan_handling: str = 'zero', include_marker_class_row: bool = False, untransformed: bool = False, filename: str | None = None) pandas.DataFrame
Exports currently loaded data from the Analysis, from self.data.
Preserves any previously performed scaling, dropped categories, & batch correction. Always of arcsinh(data / 5) transformed data. Can export the entirety of relevant self.data information, or export subsets of self.data, and/or export aggregate summary statistics for groups within the data.
- Args:
- subset_columns (list[str] or None):
a list of strings denoting the columns to subset the dataframe’s rows on (here and in other arguments, non-string input is attepmted to be cast to strings inside the function, as well as the corresponding column of the data). if this or subset_types is None, no subsetting occurs.
- subset_types (list[list[str]] or None):
a list contains sub-lists for strings. The length of the upper list must be the length of the subset_columns list, as each sub-list contains strings corresponding to the rows to keep.
As in: if subset_columns = [‘column1’, ‘column3’] and subset_types = [[‘type2’, ‘type6’],[‘typeB’, ‘typeZ’]], then rows of type2 / type6 in column1 will be kept, and similarly rows of typeB / typeZ in column2.
When > 1 columns / conditions are subsetted on, as in the above example, the rows that are kept are the union of all the subsetting conditions WITHIN a given column, but the intersection BETWEEN what is kept from each column. So in the above example, all rows of column1 == type2/6 that also have column2 == typeB/Z are the rows that are maintained.
- groupby_columns (list[str] or None):
A list of strings indicating what columns of the data to groupby. If None, then grouping is not performed. Used like this: self.data.obs.groupby(groupby_columns) but on a dataframe containing the data.X values as well
- statistic (str):
Possible values: ‘mean’,’median’,’sum’,’std’,’count’. Denotes the pandas groupby method to be used after grouping (ignored if groupby_columns is None). Numeric methods (mean, median, sum, std) are only applied to numeric columns, so only those columns + the groupby columns will be in the final dataframe / csv
- groupby_nan_handling(str):
‘zero’ or ‘drop’ – when grouping the data whether to drop (nans), which usually represent non-existent category combinations or to convert nans to zeros. Any other values of this parameter will cause NaNs to be left as-is in the data export Note that the default (and only option available in GUI) is ‘zero’, which converts ALL NaN values to 0, while the ‘drop’ option only drops rows where EVERY numerical value is NaN. By default, all possible groupby_columns combinations are included in the export (even if they are not present in the data, such cell types not present in every ROI), This is the source of most NaN values. Notably, columnns in the metadata (not data.obs!) of the Analysis are given special treatment to try to prevent non-existent experimental categories from having data exported (for example, each ROI / sample_id should have been with a single condition, not every possible condition in the dataset).
- include_marker_class_row (bool):
Whether to include the marker_class information as a row at the bottom of the table –> True to include this row – useful for reimport into PalmettoBUG. False to not include this row – this is probably better for import into non-PalmettoBUG software for analysis, or at the least the user will need to remember to remove this row before analyzing! When the marker_class row is included, it is encoded as integers (to prevent mixed dtype issues/warnings on reload)
>>> 0 = 'none', 1 = 'type', 2 = 'state'
metadata columns (which have no marker_class) have this row filled with ‘na’. NOT USED IN COMBINATION WITH GROUPING!
- untransformed (bool):
if True, will export the untransformed (pre-arcsinh, pre-scaling, etc., etc.) data, from self.data.uns[‘count’]. Provided so that the raw data is not difficult to recover, although not expected to be used frequently. Default == False.
- filename: (str, or None):
the name of the csv file to save the exported dataframe inside the self.data_table_dir folder. If None, no export occurs, and the data table is only returned.
- Returns:
(pandas DataFrame) – the pandas dataframe representing the exported data.
- Inputs/Outputs:
- Outputs:
If filename is provided (is not None), then exports the data table to self.data_table_dir/filename.csv
- export_DR(kind: str = 'umap', filename: str | None = None) tuple[pandas.DataFrame, str]
Exports a dimensionality reduction embedding (PCA or UMAP)
- Args:
- kind (str):
one of – “umap”, “pca” – the type of embedding to export
- filename (str or None):
the filename of the csv file to export to self.data_table_dir. if None, no export occurs, only the dataframe is returned
- Returns:
(pandas dataframe) this contains three columns, the dim1/2 of the embedding + the cell number as in self.data (needed as downsampling is typically used for DR)
- Inputs/Outputs:
- Outputs:
If filename is provided (is not None), then exports the UMAP/PCA table to self.data_table_dir/filename.csv
- export_clustering(groupby_column: str = 'metaclustering', identifier: str = '') tuple[pandas.DataFrame, str]
Saves a clustering to self.clusterings_dir as a csv. ALWAYS exports to the disk
These saved clustering files are used for both reloading a clustering later and for loading info into a Spatial analysis. The filename of a clustering is its (groupby_column + identifier).csv
- Args:
- groupby_column (str):
the title of the column in the self.data.obs that represents a particular cell type clustering to save. this string forms the first part of the filename of the csv saved to self.clusterings_dir For the sake of reloading, expected to be one of:
– “classification”, “leiden”, “merging”, “metaclustering”, “clustering” –
If groupby_column == “”, then all expected groupby columns, as listed above, will be attempted to be added to the exported file.
- identifier (str):
a string that forms the second part of the filename of the csv saved to self.clusterings_dir
- Returns:
(pandas dataframe): the pandas dataframe that is written to the csv file at the export path
(str): The path to the exported csv file
- Inputs/Outputs:
- Inputs:
Attempts to read from directory above self.directory / regionprops –> looking for regionproperty data to export with the clustering, which can be used in spatial analysis.
- Outputs:
Writes the clustering data table to self.clusterings_dir/{groupby_column}{identifier}.csv
- export_clustering_classy_masks(clustering='merging', identifier='')
Intent of this function is to write a “classy mask” folder from an annotation
- NOTE:
This method depends on the original mask folder and the analysis being linked properly. If data has been dropped from the analysis, then those dropped cells will either be ignored (if an entire sample_id was dropped), or they will be assigned to a ‘none’ label
Uses: visualization, mainly. Perhaps could be used in extending masks
- Args:
- clustering (str):
A column in self.data.obs to categorize the cells by. Each unique value in this column will receive a unique integer number to classify its cell mask by.
- identifier (str):
if not the empty string ‘’, will be appended to the name of the saved classy mask folder / CSV, This name will follow the convention f’{name of the original cell masks folder}_{identifier}’. Use this make sure that the resultant classy masks have a memorable / distinct name.
- Returns:
a pandas dataframe containing the clustering assignments of every cell in the style of a classy_mask, including, critically the integer assigned to each cluster in the classy masks
- Inputs / Outputs:
- Inputs:
expects to find cell masks at self.input_mask_folder, whose masks correspon to the cells in the sample_id’s of the analysis
- Outputs:
writes to a new classy mask folder at f’{project}/classy_masks/{name of the original cell masks folder}_{identifier}’, including the classified masks themselves in a sub-folder, and dataframes containing information about the classes of the classy masks & their corresponding labels.
- load_clustering(path: str | pathlib.Path) None
Looks in the self.clusterings_dir for a filename that matches [choice] and loads it at a clustering
Expects to find a csv file with the same format as exported by self.export_clustering. Attempts to confirm that the data is unchanged from when the clustering was originally exported. This includes dropped data, batch correction & scaling – so be sure that these are the same when loading a clustering.
- Args:
- choice (str):
the filename for the csv to be loaded, which should exist in the self.clusterings_dir folder
- Returns:
None (modifies self)
- Inputs/Outputs:
- Inputs:
Reads from path, presuming path is the full filename of a .csv file created by self.export_clustering()
- load_classification(cell_classifications: pandas.DataFrame | pathlib.Path | str, column: str = 'labels') None
Load a cell classification from the output data table of classy mask generation using a pixel classifier.
- Args:
- cell_classifications (pandas dataframe, or string / Path):
either a pandas dataframe, or the path to a csv file from which a pandas dataframe will be read. The dataframe must contain one of two columns: “label” and/or “classification” with cell type labels and be equal in length with self.data.obs
- column (str):
“classification” load attempt to load pixel classification numbers, “labels” to try to load biological labels first
- Returns:
None (modifies self)
- Inputs/Outputs:
- Inputs:
if cell_classifications is not a pandas dataframe, attempts to read a .csv from the path specified by cell_classifications