DataSet
DataSet
- class DataSet(loader, metadata_path=None, sample_column=None)[source]
Analysis Object
- ancova(protein_id: str, covar: Union[str, list], between: str) DataFrame
Analysis of covariance (ANCOVA) with on or more covariate(s). Wrapper around = https://pingouin-stats.org/generated/pingouin.ancova.html
- Parameters
protein_id (str) – ProteinID/ProteinGroup - dependent variable
covar (str or list) – Name(s) of column(s) in metadata with the covariate.
between (str) – Name of column in data with the between factor.
- Returns
ANCOVA summary:
'Source': Names of the factor considered'SS': Sums of squares'DF': Degrees of freedom'F': F-values'p-unc': Uncorrected p-values'np2': Partial eta-squared
- Return type
pandas.Dataframe
- anova(column: str, protein_ids='all', tukey: bool = True) DataFrame
One-way Analysis of Variance (ANOVA)
- Parameters
column (str) – A metadata column used to calculate ANOVA
protein_ids (str or list, optional) – ProteinIDs to calculate ANOVA for - dependend variable either ProteinID as string, several ProteinIDs as list or “all” to calculate ANOVA for all ProteinIDs. Defaults to “all”.
tukey (bool, optional) – Whether to calculate a Tukey-HSD post-hoc test. Defaults to True.
- Returns
'Protein ID': ProteinID/ProteinGroup'ANOVA_pvalue': p-value of ANOVA'A vs. B Tukey test': Tukey-HSD corrected p-values (each combination represents a column)
- Return type
pandas.DataFrame
- batch_correction(batch: str)
Correct for technical bias/batch effects Behdenna A, Haziza J, Azencot CA and Nordor A. (2020) pyComBat, a Python tool for batch effects correction in high-throughput molecular data using empirical Bayes methods. bioRxiv doi: 10.1101/2020.03.17.995431
- Parameters
batch (str) – column name in the metadata describing the different batches
- create_matrix()[source]
Creates a matrix of the Outputfile, with columns displaying features (Proteins) and rows the samples.
- diff_expression_analysis(group1: Union[str, list], group2: Union[str, list], column: str = None, method: str = 'ttest', perm: int = 10, fdr: float = 0.05) DataFrame
Perform differential expression analysis doing a a t-test or Wald test. A wald test will fit a generalized linear model.
- Parameters
column (str) – column name in the metadata file with the two groups to compare
group1 (str/list) – name of group to compare needs to be present in column or list of sample names to compare
group2 (str/list) – name of group to compare needs to be present in column or list of sample names to compare
method (str,optional) – statistical method to calculate differential expression, for Wald-test ‘wald’, paired t-test ‘paired-ttest’. Default ‘ttest’
- Returns
pandas Dataframe with foldchange, foldchange_log2 and pvalue for each ProteinID/ProteinGroup between group1 and group2.
'Protein ID': ProteinID/ProteinGroup'pval': p-value of the ProteinID/ProteinGroup'qval': multiple testing - corrected p-value'log2fc': log2(foldchange)'grad': the gradient of the log-likelihood'coef_mle': the maximum-likelihood estimate of coefficient in liker-space'coef_sd': the standard deviation of the coefficient in liker-space'll': the log-likelihood of the estimation
- Return type
pandas.DataFrame
- go_abundance_correction(bg_sample, fg_sample=None, fg_protein_list=None)
Gene Ontology Enrichement Analysis with abundance correction. Using the API connection from a GO tool: https://agotool.org
For the analysis modified proteins in the foreground sample are compared with proteins and their intensity of the background sample. In case there is no information about PTMs in the dataset a list of enriched proteins in the foreground can be loaded. This list of Protein IDs can be obtaint by performing a differential expression analysis or a ANOVA.
- Parameters
fg_sample (str) – name of foreground sample
bg_sample (str) – name of background sample
fg_protein_list (list, optional) – list of enriched protein ids in the foreground sample. Defaults to None.
- Returns
DataFrame *
'rank': The rank is a combination of uncorrected p value and effect size (based on s value). It serves to highlight the most interesting results and tries to emphasize the importance of the effect size. *'term': A unique identifier for a specific functional category. *'description': A short description (or title) of a functional term. *'p value corrected': p value without multiple testing correction, stemming from either Fisher’s exact test or Kolmorov Smirnov test (only for “Gene Ontology Cellular Component TEXTMINING”, “Brenda Tissue Ontoloy”, and “Disease Ontology” since these are based on a continuous score from text mining rather than a binary classification). *'effect size': Proportion of the Foregrounda nd the Background *'description': A short description (or title) of a functional term. *'year': Year of the scientific publication. *'over_under': Overrepresented (o) or underrepresented (u). *'s_value': The s value is a combination of (minus log) p value and effect size. *'ratio_in_foreground': The ratio in the ForeGround is calculated by dividing the number of positive associations for a given term by the number of input proteins (protein groups) for the Foreground. *'ratio_in_background': The ratio in the BackGround is analogous to the above ratio in the FG, using the associations for the background and Background input proteins instead. *'foreground_count': The ForeGround count consists of the number of all positive associations for the given term (i.e. how many proteins are associated with the given term). *'foreground_n': ForeGround n is comprised of the number of input proteins for the Foreground. *'background_count': The BackGround count is analogous to the “FG count” for the Background. *'background_n': BackGround n is analogous to “FG n”. *'foreground_ids': ForeGround IDentifierS are semicolon separated protein identifers of the Forground that are associated with the given term. *'background_ids': BackGround IDentifierS are analogous to “FG IDs” for the Background. *'etype': Short for “Entity type”, numeric internal identifer for different functional categories.- Return type
pandas.DataFrame
- go_characterize_foreground(protein_list, tax_id=9606)
Display existing functional annotations for your protein(s) of interest. No statistical test for enrichment is performed. Using the API connection from a GO tool: https://agotool.org
- Parameters
tax_id (int, optional) – NCBI taxon identifier used as background. Defaults to 9606 (=Homo sapiens).
protein_list (list) – list of enriched protein ids in the foreground sample.
- Returns
DataFrame *
'rank': The rank is a combination of uncorrected p value and effect size (based on s value). It serves to highlight the most interesting results and tries to emphasize the importance of the effect size. *'term': A unique identifier for a specific functional category. *'description': A short description (or title) of a functional term. *'p value corrected': p value without multiple testing correction, stemming from either Fisher’s exact test or Kolmorov Smirnov test (only for “Gene Ontology Cellular Component TEXTMINING”, “Brenda Tissue Ontoloy”, and “Disease Ontology” since these are based on a continuous score from text mining rather than a binary classification). *'effect size': Proportion of the Foregrounda nd the Background *'description': A short description (or title) of a functional term. *'year': Year of the scientific publication. *'over_under': Overrepresented (o) or underrepresented (u). *'s_value': The s value is a combination of (minus log) p value and effect size. *'ratio_in_foreground': The ratio in the ForeGround is calculated by dividing the number of positive associations for a given term by the number of input proteins (protein groups) for the Foreground. *'ratio_in_background': The ratio in the BackGround is analogous to the above ratio in the FG, using the associations for the background and Background input proteins instead. *'foreground_count': The ForeGround count consists of the number of all positive associations for the given term (i.e. how many proteins are associated with the given term). *'foreground_n': ForeGround n is comprised of the number of input proteins for the Foreground. *'background_count': The BackGround count is analogous to the “FG count” for the Background. *'background_n': BackGround n is analogous to “FG n”. *'foreground_ids': ForeGround IDentifierS are semicolon separated protein identifers of the Forground that are associated with the given term. *'background_ids': BackGround IDentifierS are analogous to “FG IDs” for the Background. *'etype': Short for “Entity type”, numeric internal identifer for different functional categories.- Return type
pandas.DataFrame
- go_compare_samples(fg_sample, bg_sample)
Gene Ontology Enrichement Analysis without abundance correction. Using the API connection from a GO tool: https://agotool.org
- Parameters
fg_sample (str) – name of the foreground sample
bg_sample (str) – name of the background sample
- Returns
DataFrame *
'rank': The rank is a combination of uncorrected p value and effect size (based on s value). It serves to highlight the most interesting results and tries to emphasize the importance of the effect size. *'term': A unique identifier for a specific functional category. *'description': A short description (or title) of a functional term. *'p value corrected': p value without multiple testing correction, stemming from either Fisher’s exact test or Kolmorov Smirnov test (only for “Gene Ontology Cellular Component TEXTMINING”, “Brenda Tissue Ontoloy”, and “Disease Ontology” since these are based on a continuous score from text mining rather than a binary classification). *'effect size': Proportion of the Foregrounda nd the Background *'description': A short description (or title) of a functional term. *'year': Year of the scientific publication. *'over_under': Overrepresented (o) or underrepresented (u). *'s_value': The s value is a combination of (minus log) p value and effect size. *'ratio_in_foreground': The ratio in the ForeGround is calculated by dividing the number of positive associations for a given term by the number of input proteins (protein groups) for the Foreground. *'ratio_in_background': The ratio in the BackGround is analogous to the above ratio in the FG, using the associations for the background and Background input proteins instead. *'foreground_count': The ForeGround count consists of the number of all positive associations for the given term (i.e. how many proteins are associated with the given term). *'foreground_n': ForeGround n is comprised of the number of input proteins for the Foreground. *'background_count': The BackGround count is analogous to the “FG count” for the Background. *'background_n': BackGround n is analogous to “FG n”. *'foreground_ids': ForeGround IDentifierS are semicolon separated protein identifers of the Forground that are associated with the given term. *'background_ids': BackGround IDentifierS are analogous to “FG IDs” for the Background. *'etype': Short for “Entity type”, numeric internal identifer for different functional categories.- Return type
pandas.DataFrame
- go_genome(tax_id=9606, fg_sample=None, protein_list=None)
Gene Ontology Enrichement Analysis using a Background from UniProt Reference Proteomes. Using the API connection from a GO tool: https://agotool.org
- Parameters
tax_id (int, optional) – NCBI taxon identifier used as background. Defaults to 9606 (=Homo sapiens).
fg_sample (str, optional) – name of sample used as foreground. Defaults to None.
protein_list (list, optional) – list of enriched protein ids in the foreground sample. Defaults to None.
- Returns
DataFrame *
'rank': The rank is a combination of uncorrected p value and effect size (based on s value). It serves to highlight the most interesting results and tries to emphasize the importance of the effect size. *'term': A unique identifier for a specific functional category. *'description': A short description (or title) of a functional term. *'p value corrected': p value without multiple testing correction, stemming from either Fisher’s exact test or Kolmorov Smirnov test (only for “Gene Ontology Cellular Component TEXTMINING”, “Brenda Tissue Ontoloy”, and “Disease Ontology” since these are based on a continuous score from text mining rather than a binary classification). *'effect size': Proportion of the Foregrounda nd the Background *'description': A short description (or title) of a functional term. *'year': Year of the scientific publication. *'over_under': Overrepresented (o) or underrepresented (u). *'s_value': The s value is a combination of (minus log) p value and effect size. *'ratio_in_foreground': The ratio in the ForeGround is calculated by dividing the number of positive associations for a given term by the number of input proteins (protein groups) for the Foreground. *'ratio_in_background': The ratio in the BackGround is analogous to the above ratio in the FG, using the associations for the background and Background input proteins instead. *'foreground_count': The ForeGround count consists of the number of all positive associations for the given term (i.e. how many proteins are associated with the given term). *'foreground_n': ForeGround n is comprised of the number of input proteins for the Foreground. *'background_count': The BackGround count is analogous to the “FG count” for the Background. *'background_n': BackGround n is analogous to “FG n”. *'foreground_ids': ForeGround IDentifierS are semicolon separated protein identifers of the Forground that are associated with the given term. *'background_ids': BackGround IDentifierS are analogous to “FG IDs” for the Background. *'etype': Short for “Entity type”, numeric internal identifer for different functional categories.- Return type
pandas.DataFrame
- load_metadata(file_path)[source]
Load metadata either xlsx, txt, csv or txt file
- Parameters
file_path (str) – path to metadata file
- plot_clustermap(**kwargs)
- plot_correlation_matrix(method='pearson')
Plot Correlation Matrix
- Parameters
method (str, optional) – orrelation coefficient “pearson”, “kendall” (Kendall Tau correlation)
"spearman" (or) –
- Returns
Correlation matrix
- Return type
plotly.graph_objects._figure.Figure
- plot_dendrogram(**kwargs)
- plot_imputed_values()
- plot_intensity(protein_id, group=None, subgroups=None, method='box', add_significance=False, log_scale=False, compare_preprocessing_modes=False)
Plot Intensity of individual Protein/ProteinGroup
- Parameters
ID (str) – ProteinGroup ID
group (str, optional) – A metadata column used for grouping. Defaults to None.
subgroups (list, optional) – Select variables from the group column. Defaults to None.
method (str, optional) – Violinplot = “violin”, Boxplot = “box”, Scatterplot = “scatter” or “all”. Defaults to “box”.
add_significance (bool, optional) – add p-value bar, only possible when two groups are compared. Defaults False.
log_scale (bool, optional) – yaxis in logarithmic scale. Defaults to False.
- Returns
Plotly Plot
- Return type
plotly.graph_objects._figure.Figure
- plot_pca(**kwargs)
- plot_sampledistribution(method='violin', color=None, log_scale=False)
Plot Intensity Distribution for each sample. Either Violin or Boxplot
- Parameters
method (str, optional) – Violinplot = “violin”, Boxplot = “box”. Defaults to “violin”.
color (str, optional) – A metadata column used to color the boxes. Defaults to None.
log_scale (bool, optional) – yaxis in logarithmic scale. Defaults to False.
- Returns
Plotly Sample Distribution Plot
- Return type
plotly.graph_objects._figure.Figure
- plot_samplehistograms()
Plots the Denisty distribution of each sample
- Returns
Plotly Graph Object
- Return type
plotly
- plot_tsne(**kwargs)
- plot_umap(**kwargs)
- plot_volcano(group1, group2, column: Optional[str] = None, method: str = 'ttest', labels: bool = False, min_fc: float = 1.0, alpha: float = 0.05, draw_line: bool = True, perm: int = 100, fdr: float = 0.05, compare_preprocessing_modes: bool = False, color_list: list = [])
Plot Volcano Plot
- Parameters
column (str) – column name in the metadata file with the two groups to compare
group1 (str/list) – name of group to compare needs to be present in column or list of sample names to compare
group2 (str/list) – name of group to compare needs to be present in column or list of sample names to compare
method (str) – “anova”, “wald”, “ttest”, “SAM” Defaul ttest.
labels (bool) – Add text labels to significant Proteins, Default False.
alpha (float,optional) – p-value cut off.
min_fc (float) – Minimum fold change.
draw_line (boolean) – whether to draw cut off lines.
perm (float,optional) – number of permutations when using SAM as method. Defaults to 100.
fdr (float,optional) – FDR cut off when using SAM as method. Defaults to 0.05.
color_list (list) – list with ProteinIDs that should be highlighted.
compare_preprocessing_modes (bool) – Will iterate through normalization and imputation modes and return a list of VolcanoPlots in different settings, Default False.
- Returns
Volcano Plot
- Return type
plotly.graph_objects._figure.Figure
- preprocess(log2_transform: bool = True, remove_contaminations: bool = False, subset: bool = False, normalization: str = None, imputation: str = None, remove_samples: list = None)
Preprocess Protein data
Removal of contaminations:
Removes all observations, that were identified as contaminations.
Normalization:
“zscore”, “quantile”, “linear”, “vst”
Normalize data using either zscore, quantile or linear (using l2 norm) Normalization.
Z-score normalization equals standaridzation using StandardScaler: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
Variance stabilization transformation uses: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html
For more information visit. Sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html
Imputation:
“mean”, “median”, “knn” or “randomforest” For more information visit:
SimpleImputer: https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
k-Nearest Neighbors Imputation: https://scikit-learn.org/stable/modules/impute.html#impute
Random Forest Imputation: https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor
- Parameters
remove_contaminations (bool, optional) – remove ProteinGroups that are identified as contamination.
log2_transform (bool, optional) – Log2 transform data. Default to True.
normalization (str, optional) – method to normalize data: either “zscore”, “quantile”, “linear”. Defaults to None.
remove_samples (list, optional) – list with sample ids to remove. Defaults to None.
imputation (str, optional) – method to impute data: either “mean”, “median”, “knn” or “randomforest”. Defaults to None.
subset (bool, optional) – filter matrix so only samples that are described in metadata found in matrix. Defaults to False.
- preprocess_print_info()
Print summary of preprocessing steps
- reset_preprocessing()
Reset all preprocessing steps
- tukey_test(protein_id: str, group: str, df: DataFrame = None) DataFrame
Calculate Pairwise Tukey-HSD post-hoc test Wrapper around: https://pingouin-stats.org/generated/pingouin.pairwise_tukey.html#pingouin.pairwise_tukey
- Parameters
protein_id (str) – ProteinID to calculate Pairwise Tukey-HSD post-hoc test - dependend variable
group (str) – A metadata column used calculate pairwise tukey
df (pandas.DataFrame, optional) – Defaults to None.
- Returns
'A': Name of first measurement'B': Name of second measurement'mean(A)': Mean of first measurement'mean(B)': Mean of second measurement'diff': Mean difference (= mean(A) - mean(B))'se': Standard error'T': T-values'p-tukey': Tukey-HSD corrected p-values'hedges': Hedges effect size (or any effect size defined in
effsize) *'comparison': combination of measurment *'Protein ID': ProteinID/ProteinGroup- Return type
pandas.DataFrame