palmettobug.Analysis_functions.WholeClassAnalysis
=================================================

.. py:module:: palmettobug.Analysis_functions.WholeClassAnalysis

.. autoapi-nested-parse::

   This module contains the back-end class (WholeClassAnalysis) that handles the analysis of pixel classes as-a-whole. It is accessed in the GUI
   through the second half of the third tab of the program (use pixel classifier).
   It is also available through the public (non-GUI) API of PalmettoBUG. 


Classes
-------

.. autoapisummary::

   palmettobug.Analysis_functions.WholeClassAnalysis.WholeClassAnalysis


Module Contents
---------------

.. py:class:: WholeClassAnalysis(directory: Union[pathlib.Path, str], classifier_df: pandas.DataFrame, metadata: pandas.DataFrame, Analysis_panel: pandas.DataFrame, csv: Union[pandas.DataFrame, None] = None)

   This class handles the whole-class Analysis, where pixel regions are treated as if they are cell segmentation masks

   It has limited options compared to the standard experiment class that handles true single cell data. This class only has a few plot options 
   and a single statistics option, and no batch correction, dropping of samples, or scaling

   Args:
       directory (string or Path): the path to a folder containing /intensities and /regionprops subfolders,
               which would have been produced by running region measurements on the pixel classification maps
               generated by a pixel classifier. 

       classifier_df (pandas dataframe): the biological_labels.csv exported from the pixel classifier whose
               output is being used. Contains "labels", "class", and/or "merge" columns,
               which help associate region numbers in the images / regionpros & intensity csvs
               with the biological labels in the classifier. 

       metadata (pandas dataframe): analogous to the metadata csv file in a standard, single-cell analysis
               Contains the same, file_name, sample_id, patient_id, condition columns

       Analysis_panel (pandas dataframe): analogous to the Analysis_panel csv file in a standard, single-cell analysis.
               For example, contains columns for antigen and marker_class.

       class_type (string): one of -- "premerge", "merged" -- whether the outputs of the classifier are before or after
               merging (relevant for what column in classifier_df is used as the class, "merging" or "class" )

   Key Attributes:
       data (anndata.AnnData): the data, with data.X being a numpy array containing the channel information per "event" (per class per image)
               data.obs being derived from the inputted metadata and data.var being derived from the Analysis_panel

       class_labels (pandas DataFrame): this is the inputted classifier_df, which associates the class numbers with biological labels

       directory (str): The path to the folder where the analysis is to be initialized. Used to set up directories 
               (such as save_dir, data_table_dir), to export some files (input_tables_to_csv) and to find the expected
               intensities / regionprops csv files when loading the data. 

       save_dir (str): The path to where plots are saved by this class (when filename is provided in plotting functions)

       data_table_dir (str): The path to where data tables are saved by this class (when filename is provided to methods that produce
               dataframes such as statistics / exports) 


   .. py:attribute:: directory
      :value: ''


   .. py:attribute:: class_labels


   .. py:attribute:: _metadata


   .. py:attribute:: _panel


   .. py:attribute:: save_dir
      :value: '/python_plots'


   .. py:attribute:: data_table_dir
      :value: '/Data_tables'


   .. py:attribute:: percent_areas
      :value: None


   .. py:method:: _load(csv: Union[pandas.DataFrame, None] = None, arcsinh_cofactor: int = 5) -> None

      Helper to the __init__ method: performs the loading and shaping of data during the initial load.


   .. py:method:: input_tables_to_csv() -> None

      Allows the saving of the primary csv files within this class to the disk inside the self.directory folder


   .. py:method:: plot_percent_areas(filename: Union[str, None] = None, N_column: str = 'sample_id', calculate_only: bool = False) -> matplotlib.pyplot.figure

      Plots a boxplots of percent class in each image, showing and comparing the distributions between conditions

      Returns the plot as a matplotlib figure

      Args:
          filename (str or None):
              If filename is specified, this will export the plot as a PNG file to self.save_dir/{filename}.png

          N_column (str):
              The aggregating group for the data. As in, the individual dots of the distribution in the boxplot will be determined
              by the unique groups in this column.

          calculate_only (bool):
              Default == False. If True (& self.percent_areas == None), this method will not return anything, 
              but instead will perform the calculation of %pixel class in each ROI. This calculation will be 
              saved to self.percent_areas, where it can easily be plotted by this function later. 
              This is implemented to save time by meaning the calculations only have to be done once

      Returns:
          matplotlib.pyplot figure or None (returns None only if calculate_only == True, and no 
          prior calculation of the % areas has been done)


   .. py:method:: plot_distribution_exprs(unique_class: Union[str, int], plot_type: str, N_column: str = 'sample_id', marker_class: str = 'All', filename: Union[str, None] = None) -> seaborn.FacetGrid

      Plots a Bar or Violin plot from the distribution of marker expression / %class in each sample_id, comparing conditions

      Args: 
          unique_class (string or integer):
              Indicates which pixel class to plot antigen expressions for

          N_column (str):
              Indicates which column in the data will serve as the aggregating column for creating the distribution in the final plot

          plot_type (string):
              'Violin' or 'Bar' -- determines what kind of plot is created

          marker_class (string):
              'All', 'type', 'state', or 'none' (or any other marker_class in self.data.var['marker_class']). Determines which antigens are used in the plot
              By default, every antigen, regardless of marker_class is used ('All'). 

          filename:
              If specified, this funciton will additionally export the plot as a PNG file to self.save_dir/{filename}.png

      Returns:
          the plot as a seaborn FacetGrid (FacetGrid.figure --> a matplotlib figure)


   .. py:method:: whole_marker_exprs_ANOVA(marker_class: str = 'All', groupby_column: str = 'class', N_column: str = 'sample_id', variable: str = 'condition', statistic: str = 'mean', area: bool = True) -> pandas.DataFrame

      Calculates statistics comparing the conditions in the experiment using ANOVA on the expression of [marker_class] markers 
      and %area of each class

      Args:
          marker_class (string): which markers / antigens to test by ANOVA. one of -- "All", "type","state", "none". 

          groupby_column (string): which column the data will be grouped by for the purposed of running separate ANOVAs for each group 
                  (as this is whole-class analysis, should always be "class")

          N_column (string): The column in the data that will defines the groups of the statistical test (i.e., the 'N' groups 
                  that contribute to the degrees of freedom in the test)

          variable (string): which column in self.data.obs will be trated as the column containing condition / group information

          statistic (string): one of --"ANOVA", "Kruskal" -- which statistical test (ANOVA, kruskal-wallis), and what aggregate statistic 
                  (mean/std or median/IQR, respectively) is calculated & displayed in the final dataframe

          area (bool): whether to also calculate an ANOVA comparing the %area of each class between the conditions (default is True)

      Returns:
          (pandas dataframe): the pandas dataframe contianing the statistical outputs of this test.


   .. py:method:: plot_heatmap(type_of_stat: str = 'F statistic', filename: Union[str, None] = None) -> matplotlib.pyplot.figure

      Plots a statistics heatmap. -Neg log(statistic) if the statistic is a p value instead of an F statistic


   .. py:method:: export_data(subset_columns: Union[list[str], None] = None, subset_types: Union[list[list[str]], None] = None, groupby_columns: Union[list[str], None] = None, statistic: str = 'mean', groupby_nan_handling: str = 'zero', include_marker_class_row: bool = False, untransformed: bool = False, filename: Union[str, None] = None) -> pandas.DataFrame

      Exports currently loaded data from the Analysis, from self.data. 

      Preserves any previously performed scaling, dropped categories, & batch correction. Always of arcsinh(data / 5) transformed data. Can
      export the entirety of relevant self.data information, or export subsets of self.data, and/or export aggregate summary statistics for 
      groups within the data. 

      Args:
          subset_columns (list[str] or None): 
              a list of strings denoting the columns to subset the dataframe's rows on (here and in other arguments, non-string input is attepmted 
              to be cast to strings inside the function, as well as the corresponding column of the data). if this or subset_types is None, no subsetting occurs. 

          subset_types (list[list[str]] or None): 
              a list contains sub-lists for strings. The length of the upper list must be the length of
              the subset_columns list, as each sub-list contains strings corresponding to the rows to keep. 

                  As in: if subset_columns = ['column1', 'column3'] and subset_types = [['type2', 'type6'],['typeB', 'typeZ']],
                  then rows of type2 / type6 in column1 will be kept, and similarly rows of typeB / typeZ in column2.

              When > 1 columns / conditions are subsetted on, as in the above example, the rows that are kept are the union of 
              all the subsetting conditions WITHIN a given column, but the intersection BETWEEN what is kept from each column. 
              So in the above example, all rows of column1 == type2/6 that also have column2 == typeB/Z are the rows that are maintained.
                                                      
          groupby_columns (list[str] or None): 
              A list of strings indicating what columns of the data to groupby. If None, then grouping is not performed.
              Used like this:    self.data.obs.groupby(groupby_columns)              but on a dataframe containing the data.X values as well

          statistic (str): 
              Possible values: 'mean','median','sum','std','count'. Denotes the pandas groupby method to be used after grouping (ignored if groupby_columns is None).
              Numeric methods (mean, median, sum, std) are only applied to numeric columns, so only those columns + the groupby columns 
              will be in the final dataframe / csv
          
          groupby_nan_handling(str):
              'zero' or 'drop' -- when grouping the data whether to drop (nans), which usually represent non-existent category combinations or to 
              convert nans to zeros. Any other values of this parameter will cause NaNs to be left as-is in the data export
              Note that the default (and only option available in GUI) is 'zero', which converts ALL NaN values to 0, while the 'drop' option only drops
              rows where EVERY numerical value is NaN.
              By default, all possible groupby_columns combinations are included in the export (even if they are not present in the data, such cell types 
              not present in every ROI), This is the source of most NaN values. Notably, columnns in the metadata (not data.obs!) of the Analysis are given special 
              treatment to try to prevent non-existent experimental categories from having data exported (for example, each ROI / sample_id should have been 
              with a single condition, not every possible condition in the dataset). 

          include_marker_class_row (bool): 
              Whether to include the marker_class information as a row at the bottom of the table --> True to 
              include this row -- useful for reimport into PalmettoBUG.
              False to not include this row -- this is probably better for import into non-PalmettoBUG software for analysis,
              or at the least the user will need to remember to remove this row before analyzing!
              When the marker_class row is included, it is encoded as integers (to prevent mixed dtype issues/warnings on reload)
              
                  >>> 0 = 'none', 1 = 'type', 2 = 'state'

              metadata columns (which have no marker_class) have this row filled with 'na'. 
              NOT USED IN COMBINATION WITH GROUPING!

          untransformed (bool):
              if True, will export the untransformed (pre-arcsinh, pre-scaling, etc., etc.) data, from self.data.uns['count'].
              Provided so that the raw data is not difficult to recover, although not expected to be used frequently. Default == False. 

          filename: (str, or None): 
              the name of the csv file to save the exported dataframe inside the self.data_table_dir folder. If None, no export occurs, and the data table is only returned. 

      Returns:
          (pandas DataFrame) -- the pandas dataframe representing the exported data. 

      Inputs/Outputs:
          Outputs: 
              If filename is provided (is not None), then exports the data table to self.data_table_dir/filename.csv