DataSet

DataSet

class DataSet(loader, metadata_path=None, sample_column=None)[source]

Analysis Object

ancova(protein_id: str, covar: Union[str, list], between: str) DataFrame

Analysis of covariance (ANCOVA) with on or more covariate(s). Wrapper around = https://pingouin-stats.org/generated/pingouin.ancova.html

Parameters
  • protein_id (str) – ProteinID/ProteinGroup - dependent variable

  • covar (str or list) – Name(s) of column(s) in metadata with the covariate.

  • between (str) – Name of column in data with the between factor.

Returns

ANCOVA summary:

  • 'Source': Names of the factor considered

  • 'SS': Sums of squares

  • 'DF': Degrees of freedom

  • 'F': F-values

  • 'p-unc': Uncorrected p-values

  • 'np2': Partial eta-squared

Return type

pandas.Dataframe

anova(column: str, protein_ids='all', tukey: bool = True) DataFrame

One-way Analysis of Variance (ANOVA)

Parameters
  • column (str) – A metadata column used to calculate ANOVA

  • protein_ids (str or list, optional) – ProteinIDs to calculate ANOVA for - dependend variable either ProteinID as string, several ProteinIDs as list or “all” to calculate ANOVA for all ProteinIDs. Defaults to “all”.

  • tukey (bool, optional) – Whether to calculate a Tukey-HSD post-hoc test. Defaults to True.

Returns

  • 'Protein ID': ProteinID/ProteinGroup

  • 'ANOVA_pvalue': p-value of ANOVA

  • 'A vs. B Tukey test': Tukey-HSD corrected p-values (each combination represents a column)

Return type

pandas.DataFrame

batch_correction(batch: str)

Correct for technical bias/batch effects Behdenna A, Haziza J, Azencot CA and Nordor A. (2020) pyComBat, a Python tool for batch effects correction in high-throughput molecular data using empirical Bayes methods. bioRxiv doi: 10.1101/2020.03.17.995431

Parameters

batch (str) – column name in the metadata describing the different batches

create_matrix()[source]

Creates a matrix of the Outputfile, with columns displaying features (Proteins) and rows the samples.

diff_expression_analysis(group1: Union[str, list], group2: Union[str, list], column: str = None, method: str = 'ttest', perm: int = 10, fdr: float = 0.05) DataFrame

Perform differential expression analysis doing a a t-test or Wald test. A wald test will fit a generalized linear model.

Parameters
  • column (str) – column name in the metadata file with the two groups to compare

  • group1 (str/list) – name of group to compare needs to be present in column or list of sample names to compare

  • group2 (str/list) – name of group to compare needs to be present in column or list of sample names to compare

  • method (str,optional) – statistical method to calculate differential expression, for Wald-test ‘wald’, paired t-test ‘paired-ttest’. Default ‘ttest’

Returns

pandas Dataframe with foldchange, foldchange_log2 and pvalue for each ProteinID/ProteinGroup between group1 and group2.

  • 'Protein ID': ProteinID/ProteinGroup

  • 'pval': p-value of the ProteinID/ProteinGroup

  • 'qval': multiple testing - corrected p-value

  • 'log2fc': log2(foldchange)

  • 'grad': the gradient of the log-likelihood

  • 'coef_mle': the maximum-likelihood estimate of coefficient in liker-space

  • 'coef_sd': the standard deviation of the coefficient in liker-space

  • 'll': the log-likelihood of the estimation

Return type

pandas.DataFrame

go_abundance_correction(bg_sample, fg_sample=None, fg_protein_list=None)

Gene Ontology Enrichement Analysis with abundance correction. Using the API connection from a GO tool: https://agotool.org

For the analysis modified proteins in the foreground sample are compared with proteins and their intensity of the background sample. In case there is no information about PTMs in the dataset a list of enriched proteins in the foreground can be loaded. This list of Protein IDs can be obtaint by performing a differential expression analysis or a ANOVA.

Parameters
  • fg_sample (str) – name of foreground sample

  • bg_sample (str) – name of background sample

  • fg_protein_list (list, optional) – list of enriched protein ids in the foreground sample. Defaults to None.

Returns

DataFrame * 'rank': The rank is a combination of uncorrected p value and effect size (based on s value). It serves to highlight the most interesting results and tries to emphasize the importance of the effect size. * 'term': A unique identifier for a specific functional category. * 'description': A short description (or title) of a functional term. * 'p value corrected': p value without multiple testing correction, stemming from either Fisher’s exact test or Kolmorov Smirnov test (only for “Gene Ontology Cellular Component TEXTMINING”, “Brenda Tissue Ontoloy”, and “Disease Ontology” since these are based on a continuous score from text mining rather than a binary classification). * 'effect size': Proportion of the Foregrounda nd the Background * 'description': A short description (or title) of a functional term. * 'year': Year of the scientific publication. * 'over_under': Overrepresented (o) or underrepresented (u). * 's_value': The s value is a combination of (minus log) p value and effect size. * 'ratio_in_foreground': The ratio in the ForeGround is calculated by dividing the number of positive associations for a given term by the number of input proteins (protein groups) for the Foreground. * 'ratio_in_background': The ratio in the BackGround is analogous to the above ratio in the FG, using the associations for the background and Background input proteins instead. * 'foreground_count': The ForeGround count consists of the number of all positive associations for the given term (i.e. how many proteins are associated with the given term). * 'foreground_n': ForeGround n is comprised of the number of input proteins for the Foreground. * 'background_count': The BackGround count is analogous to the “FG count” for the Background. * 'background_n': BackGround n is analogous to “FG n”. * 'foreground_ids': ForeGround IDentifierS are semicolon separated protein identifers of the Forground that are associated with the given term. * 'background_ids': BackGround IDentifierS are analogous to “FG IDs” for the Background. * 'etype': Short for “Entity type”, numeric internal identifer for different functional categories.

Return type

pandas.DataFrame

go_characterize_foreground(protein_list, tax_id=9606)

Display existing functional annotations for your protein(s) of interest. No statistical test for enrichment is performed. Using the API connection from a GO tool: https://agotool.org

Parameters
  • tax_id (int, optional) – NCBI taxon identifier used as background. Defaults to 9606 (=Homo sapiens).

  • protein_list (list) – list of enriched protein ids in the foreground sample.

Returns

DataFrame * 'rank': The rank is a combination of uncorrected p value and effect size (based on s value). It serves to highlight the most interesting results and tries to emphasize the importance of the effect size. * 'term': A unique identifier for a specific functional category. * 'description': A short description (or title) of a functional term. * 'p value corrected': p value without multiple testing correction, stemming from either Fisher’s exact test or Kolmorov Smirnov test (only for “Gene Ontology Cellular Component TEXTMINING”, “Brenda Tissue Ontoloy”, and “Disease Ontology” since these are based on a continuous score from text mining rather than a binary classification). * 'effect size': Proportion of the Foregrounda nd the Background * 'description': A short description (or title) of a functional term. * 'year': Year of the scientific publication. * 'over_under': Overrepresented (o) or underrepresented (u). * 's_value': The s value is a combination of (minus log) p value and effect size. * 'ratio_in_foreground': The ratio in the ForeGround is calculated by dividing the number of positive associations for a given term by the number of input proteins (protein groups) for the Foreground. * 'ratio_in_background': The ratio in the BackGround is analogous to the above ratio in the FG, using the associations for the background and Background input proteins instead. * 'foreground_count': The ForeGround count consists of the number of all positive associations for the given term (i.e. how many proteins are associated with the given term). * 'foreground_n': ForeGround n is comprised of the number of input proteins for the Foreground. * 'background_count': The BackGround count is analogous to the “FG count” for the Background. * 'background_n': BackGround n is analogous to “FG n”. * 'foreground_ids': ForeGround IDentifierS are semicolon separated protein identifers of the Forground that are associated with the given term. * 'background_ids': BackGround IDentifierS are analogous to “FG IDs” for the Background. * 'etype': Short for “Entity type”, numeric internal identifer for different functional categories.

Return type

pandas.DataFrame

go_compare_samples(fg_sample, bg_sample)

Gene Ontology Enrichement Analysis without abundance correction. Using the API connection from a GO tool: https://agotool.org

Parameters
  • fg_sample (str) – name of the foreground sample

  • bg_sample (str) – name of the background sample

Returns

DataFrame * 'rank': The rank is a combination of uncorrected p value and effect size (based on s value). It serves to highlight the most interesting results and tries to emphasize the importance of the effect size. * 'term': A unique identifier for a specific functional category. * 'description': A short description (or title) of a functional term. * 'p value corrected': p value without multiple testing correction, stemming from either Fisher’s exact test or Kolmorov Smirnov test (only for “Gene Ontology Cellular Component TEXTMINING”, “Brenda Tissue Ontoloy”, and “Disease Ontology” since these are based on a continuous score from text mining rather than a binary classification). * 'effect size': Proportion of the Foregrounda nd the Background * 'description': A short description (or title) of a functional term. * 'year': Year of the scientific publication. * 'over_under': Overrepresented (o) or underrepresented (u). * 's_value': The s value is a combination of (minus log) p value and effect size. * 'ratio_in_foreground': The ratio in the ForeGround is calculated by dividing the number of positive associations for a given term by the number of input proteins (protein groups) for the Foreground. * 'ratio_in_background': The ratio in the BackGround is analogous to the above ratio in the FG, using the associations for the background and Background input proteins instead. * 'foreground_count': The ForeGround count consists of the number of all positive associations for the given term (i.e. how many proteins are associated with the given term). * 'foreground_n': ForeGround n is comprised of the number of input proteins for the Foreground. * 'background_count': The BackGround count is analogous to the “FG count” for the Background. * 'background_n': BackGround n is analogous to “FG n”. * 'foreground_ids': ForeGround IDentifierS are semicolon separated protein identifers of the Forground that are associated with the given term. * 'background_ids': BackGround IDentifierS are analogous to “FG IDs” for the Background. * 'etype': Short for “Entity type”, numeric internal identifer for different functional categories.

Return type

pandas.DataFrame

go_genome(tax_id=9606, fg_sample=None, protein_list=None)

Gene Ontology Enrichement Analysis using a Background from UniProt Reference Proteomes. Using the API connection from a GO tool: https://agotool.org

Parameters
  • tax_id (int, optional) – NCBI taxon identifier used as background. Defaults to 9606 (=Homo sapiens).

  • fg_sample (str, optional) – name of sample used as foreground. Defaults to None.

  • protein_list (list, optional) – list of enriched protein ids in the foreground sample. Defaults to None.

Returns

DataFrame * 'rank': The rank is a combination of uncorrected p value and effect size (based on s value). It serves to highlight the most interesting results and tries to emphasize the importance of the effect size. * 'term': A unique identifier for a specific functional category. * 'description': A short description (or title) of a functional term. * 'p value corrected': p value without multiple testing correction, stemming from either Fisher’s exact test or Kolmorov Smirnov test (only for “Gene Ontology Cellular Component TEXTMINING”, “Brenda Tissue Ontoloy”, and “Disease Ontology” since these are based on a continuous score from text mining rather than a binary classification). * 'effect size': Proportion of the Foregrounda nd the Background * 'description': A short description (or title) of a functional term. * 'year': Year of the scientific publication. * 'over_under': Overrepresented (o) or underrepresented (u). * 's_value': The s value is a combination of (minus log) p value and effect size. * 'ratio_in_foreground': The ratio in the ForeGround is calculated by dividing the number of positive associations for a given term by the number of input proteins (protein groups) for the Foreground. * 'ratio_in_background': The ratio in the BackGround is analogous to the above ratio in the FG, using the associations for the background and Background input proteins instead. * 'foreground_count': The ForeGround count consists of the number of all positive associations for the given term (i.e. how many proteins are associated with the given term). * 'foreground_n': ForeGround n is comprised of the number of input proteins for the Foreground. * 'background_count': The BackGround count is analogous to the “FG count” for the Background. * 'background_n': BackGround n is analogous to “FG n”. * 'foreground_ids': ForeGround IDentifierS are semicolon separated protein identifers of the Forground that are associated with the given term. * 'background_ids': BackGround IDentifierS are analogous to “FG IDs” for the Background. * 'etype': Short for “Entity type”, numeric internal identifer for different functional categories.

Return type

pandas.DataFrame

load_metadata(file_path)[source]

Load metadata either xlsx, txt, csv or txt file

Parameters

file_path (str) – path to metadata file

overview()[source]

Print overview of the DataSet

plot_clustermap(**kwargs)
plot_correlation_matrix(method='pearson')

Plot Correlation Matrix

Parameters
  • method (str, optional) – orrelation coefficient “pearson”, “kendall” (Kendall Tau correlation)

  • "spearman" (or) –

Returns

Correlation matrix

Return type

plotly.graph_objects._figure.Figure

plot_dendrogram(**kwargs)
plot_imputed_values()
plot_intensity(protein_id, group=None, subgroups=None, method='box', add_significance=False, log_scale=False, compare_preprocessing_modes=False)

Plot Intensity of individual Protein/ProteinGroup

Parameters
  • ID (str) – ProteinGroup ID

  • group (str, optional) – A metadata column used for grouping. Defaults to None.

  • subgroups (list, optional) – Select variables from the group column. Defaults to None.

  • method (str, optional) – Violinplot = “violin”, Boxplot = “box”, Scatterplot = “scatter” or “all”. Defaults to “box”.

  • add_significance (bool, optional) – add p-value bar, only possible when two groups are compared. Defaults False.

  • log_scale (bool, optional) – yaxis in logarithmic scale. Defaults to False.

Returns

Plotly Plot

Return type

plotly.graph_objects._figure.Figure

plot_pca(**kwargs)
plot_sampledistribution(method='violin', color=None, log_scale=False)

Plot Intensity Distribution for each sample. Either Violin or Boxplot

Parameters
  • method (str, optional) – Violinplot = “violin”, Boxplot = “box”. Defaults to “violin”.

  • color (str, optional) – A metadata column used to color the boxes. Defaults to None.

  • log_scale (bool, optional) – yaxis in logarithmic scale. Defaults to False.

Returns

Plotly Sample Distribution Plot

Return type

plotly.graph_objects._figure.Figure

plot_samplehistograms()

Plots the Denisty distribution of each sample

Returns

Plotly Graph Object

Return type

plotly

plot_tsne(**kwargs)
plot_umap(**kwargs)
plot_volcano(group1, group2, column: Optional[str] = None, method: str = 'ttest', labels: bool = False, min_fc: float = 1.0, alpha: float = 0.05, draw_line: bool = True, perm: int = 100, fdr: float = 0.05, compare_preprocessing_modes: bool = False, color_list: list = [])

Plot Volcano Plot

Parameters
  • column (str) – column name in the metadata file with the two groups to compare

  • group1 (str/list) – name of group to compare needs to be present in column or list of sample names to compare

  • group2 (str/list) – name of group to compare needs to be present in column or list of sample names to compare

  • method (str) – “anova”, “wald”, “ttest”, “SAM” Defaul ttest.

  • labels (bool) – Add text labels to significant Proteins, Default False.

  • alpha (float,optional) – p-value cut off.

  • min_fc (float) – Minimum fold change.

  • draw_line (boolean) – whether to draw cut off lines.

  • perm (float,optional) – number of permutations when using SAM as method. Defaults to 100.

  • fdr (float,optional) – FDR cut off when using SAM as method. Defaults to 0.05.

  • color_list (list) – list with ProteinIDs that should be highlighted.

  • compare_preprocessing_modes (bool) – Will iterate through normalization and imputation modes and return a list of VolcanoPlots in different settings, Default False.

Returns

Volcano Plot

Return type

plotly.graph_objects._figure.Figure

preprocess(log2_transform: bool = True, remove_contaminations: bool = False, subset: bool = False, normalization: str = None, imputation: str = None, remove_samples: list = None)

Preprocess Protein data

Removal of contaminations:

Removes all observations, that were identified as contaminations.

Normalization:

“zscore”, “quantile”, “linear”, “vst”

Normalize data using either zscore, quantile or linear (using l2 norm) Normalization.

Z-score normalization equals standaridzation using StandardScaler: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Variance stabilization transformation uses: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html

For more information visit. Sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html

Imputation:

“mean”, “median”, “knn” or “randomforest” For more information visit:

SimpleImputer: https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

k-Nearest Neighbors Imputation: https://scikit-learn.org/stable/modules/impute.html#impute

Random Forest Imputation: https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor

Parameters
  • remove_contaminations (bool, optional) – remove ProteinGroups that are identified as contamination.

  • log2_transform (bool, optional) – Log2 transform data. Default to True.

  • normalization (str, optional) – method to normalize data: either “zscore”, “quantile”, “linear”. Defaults to None.

  • remove_samples (list, optional) – list with sample ids to remove. Defaults to None.

  • imputation (str, optional) – method to impute data: either “mean”, “median”, “knn” or “randomforest”. Defaults to None.

  • subset (bool, optional) – filter matrix so only samples that are described in metadata found in matrix. Defaults to False.

preprocess_print_info()

Print summary of preprocessing steps

reset_preprocessing()

Reset all preprocessing steps

tukey_test(protein_id: str, group: str, df: DataFrame = None) DataFrame

Calculate Pairwise Tukey-HSD post-hoc test Wrapper around: https://pingouin-stats.org/generated/pingouin.pairwise_tukey.html#pingouin.pairwise_tukey

Parameters
  • protein_id (str) – ProteinID to calculate Pairwise Tukey-HSD post-hoc test - dependend variable

  • group (str) – A metadata column used calculate pairwise tukey

  • df (pandas.DataFrame, optional) – Defaults to None.

Returns

  • 'A': Name of first measurement

  • 'B': Name of second measurement

  • 'mean(A)': Mean of first measurement

  • 'mean(B)': Mean of second measurement

  • 'diff': Mean difference (= mean(A) - mean(B))

  • 'se': Standard error

  • 'T': T-values

  • 'p-tukey': Tukey-HSD corrected p-values

  • 'hedges': Hedges effect size (or any effect size defined in

effsize) * 'comparison': combination of measurment * 'Protein ID': ProteinID/ProteinGroup

Return type

pandas.DataFrame