This is a QC report based on the following data:
Here we visualize the overall PCs based on just the data matrix of probes by samples and the NPX values contained. We have colored each point arbitrarily by the first metadata variable that is available (to limit having hard requirements on the metadata). For a more detailed breakdown of technical metadata that correlate with proteomic variation, please see the last section of this report for a statistical test of significant covariates.
<- plot_ly(merged.meta.df,
fig.pca x = ~PC1, y = ~PC2, text = ~SampleID,
type ="scatter",
mode = "markers",
#name = as.formula(paste("~", meta.variables[1])),
color = as.formula(paste("~", meta.variables[1])),
marker = list(size = 10)
%>% layout(
) title = "PCA Overview of Olink Dataset",
legend = list(title = list(text = paste("<b>", meta.variables[1], "</b>"))),
xaxis = list(title = "PC1"),
yaxis = list(title = "PC2")
)
fig.pca
Missingness across samples are measured by the proportion of assays that fail Olink’s internal QC measures. Full details on how to interpret their QC metrics and warnings can be found here: https://www.olink.com/faq/how-is-quality-control-of-the-data-performed/.
QC Warnings from Olink are provided based on internal
controls and detection controls with respect to a given
96-well plate. The rule for determining if a QC Warning is emitted is if
any of these control values deviate more than
0.3 NPX
from the plate median. As a general rule, if more
than 1/6th of the samples fail the QC then the run is deemed to be
unreliable and this should be noted and sent as immediate feedback to
the Olink team.
<- fread(snakemake@input[["merged_olink_data"]], sep = "\t")
olink.merged.df <- olink.merged.df %>%
qc.warning.samples.df group_by(SampleID) %>%
summarise(QC_Warning = sum(QC_Warning == "WARN"), N = n()) %>%
mutate(Fraction_QC_Warning = QC_Warning / N)
reactable(qc.warning.samples.df)
The Olink Explore platform allows one to evaluate proteomic variation across different panels (e.g. Neurology or Oncology). Since each 96-well plate has a limitation of 384 probes, each panel is effectively a different experiment. As these are the same samples, but different panels, each panel may exhibit different properties across samples, including missingness. Here we quantify the extent to which each sample has probes not passing QC on each of the panels to get a more granular view at how this varies across multiple panels.
<- olink.merged.df %>%
qc.warning.samples.panel.df group_by(SampleID, Panel) %>%
summarise(QC_Warning = sum(QC_Warning == "WARN"), N = n()) %>%
mutate(Fraction_QC_Warning = QC_Warning / N)
reactable(qc.warning.samples.panel.df)
We also display this missingness below across all panels to visualize individuals whose missingness may be driven largely by a single panel or if missingness is well-distributed across panels:
<- qc.warning.samples.panel.df %>%
warning.samples.panel.df pivot_wider(
id_cols = "SampleID",
names_from = Panel,
values_from = QC_Warning,
values_fn = mean
)
<- reshape2::melt(warning.samples.panel.df, id.vars = "SampleID") %>%
fig plot_ly(
x = ~SampleID, y = ~value, type = "bar",
name = ~variable, color = ~variable
%>%
) layout(
yaxis = list(title = "Count"),
xaxis = list(categoryorder = "total descending"),
legend = list(title = list(text = "<b> Panel </b>")),
barmode = "stack"
) fig
Outliers (either in terms of missingness or measurement) can appear
on a per-panel basis and we may want to filter specific individuals
per-panel. In this setting we are recreating the
olink_qc_plot
function from the OlinkAnalyze
package but for a more interactive view per-panel. Here we set the
threshold for outlier detection as being outside:
# Defining outliers based on medians and assay IQR
<- olink.merged.df %>%
olink.outliers.df inner_join(merged.meta.df, by = "SampleID") %>%
group_by(Panel, SampleID, Index) %>%
mutate(
iqr = IQR(NPX, na.rm = TRUE),
sample_median = median(NPX, na.rm = TRUE),
pc1_mean = mean(PC1, na.rm = TRUE),
pc2_mean = mean(PC2, na.rm = TRUE)
%>%
) ungroup() %>%
select(SampleID, Index, Panel, iqr, sample_median, pc1_mean, pc2_mean, PC1, PC2) %>%
distinct() %>%
group_by(Panel) %>%
mutate(
pc1_outlier_low = mean(pc1_mean, na.rm = TRUE) - pc.sd * sd(pc1_mean, na.rm = TRUE),
pc1_outlier_high = mean(pc1_mean, na.rm = TRUE) + pc.sd * sd(pc1_mean, na.rm = TRUE),
pc2_outlier_low = mean(pc2_mean, na.rm = TRUE) - pc.sd * sd(pc2_mean, na.rm = TRUE),
pc2_outlier_high = mean(pc2_mean, na.rm = TRUE) + pc.sd * sd(pc2_mean, na.rm = TRUE),
median_low = mean(sample_median, na.rm = TRUE) - median.sd * sd(sample_median, na.rm = TRUE),
median_high = mean(sample_median, na.rm = TRUE) + median.sd * sd(sample_median, na.rm = TRUE),
iqr_low = mean(iqr, na.rm = TRUE) - iqr.sd * sd(iqr, na.rm = TRUE),
iqr_high = mean(iqr, na.rm = TRUE) + iqr.sd * sd(iqr, na.rm = TRUE),
%>%
) ungroup() %>%
mutate(valid_outliers = remove.outliers) %>%
mutate(outlier_range = ifelse(sample_median > median_low & sample_median < median_high & iqr > iqr_low & iqr < iqr_high & valid_outliers, FALSE, TRUE)) %>%
mutate(pc_outlier = ifelse(pc1_mean > pc1_outlier_low & pc1_mean < pc1_outlier_high & pc2_mean > pc2_outlier_low & pc2_mean < pc2_outlier_high & valid_outliers, FALSE, TRUE)) %>%
mutate(outlier = (outlier_range | pc_outlier) & valid_outliers)
<- unique(olink.outliers.df$Panel)
uniq.panels
<- plot_ly(olink.outliers.df %>% filter(Panel == "Neurology"),
fig1 x = ~sample_median,
y = ~iqr,
type = "scatter",
mode = "markers",
text = ~SampleID,
color = ~outlier,
marker = list(size = 10)
%>%
) layout(
yaxis = list(title = "IQR"),
xaxis = list(title = "Sample Median"),
legend = list(title = list(text = "<b> Outlier </b>")),
title = "Neurology"
)
<- plot_ly(olink.outliers.df %>% filter(Panel == "Cardiometabolic"),
fig2 x = ~sample_median,
y = ~iqr,
type = "scatter",
mode = "markers",
text = ~SampleID,
color = ~outlier,
marker = list(size = 10)
%>%
) layout(
yaxis = list(title = "IQR"),
xaxis = list(title = "Sample Median"),
legend = list(title = list(text = "<b> Outlier </b>")),
title = "Cardiometabolic"
)
<- plot_ly(olink.outliers.df %>% filter(Panel == "Oncology"),
fig3 x = ~sample_median,
y = ~iqr,
type = "scatter",
mode = "markers",
text = ~SampleID,
color = ~outlier,
marker = list(size = 10)
%>%
) layout(
yaxis = list(title = "IQR"),
xaxis = list(title = "Sample Median"),
legend = list(title = list(text = "<b> Outlier </b>")),
title = "Oncology"
)
<- plot_ly(olink.outliers.df %>% filter(Panel == "Inflammation"),
fig4 x = ~sample_median,
y = ~iqr,
type = "scatter",
mode = "markers",
text = ~SampleID,
color = ~outlier,
marker = list(size = 10)
%>%
) layout(
yaxis = list(title = "IQR"),
xaxis = list(title = "Sample Median"),
legend = list(title = list(text = "<b> Outlier </b>")),
title = "Inflammation"
)
fig1
fig2
fig3
fig4
The PCA-based outliers are also visualized in the following plot. The rule to determine outliers is based on a sample being outside 3 outside standard deviations from the means of PC1 and PC2
<- plot_ly(olink.outliers.df,
fig x = ~PC1,
y = ~PC2,
type = "scatter",
mode = "markers",
text = ~SampleID,
color = ~pc_outlier,
marker = list(size = 10)
%>%
) layout(
yaxis = list(title = "PC2"),
xaxis = list(title = "PC1"),
legend = list(title = list(text = "<b> Outlier </b>")),
title = "PCA-based Outliers"
)
fig
Based on the data presented above, we recommend removing the following Sample / Panel combinations:
<- olink.outliers.df %>%
olink.outliers.filt.df filter(outlier == TRUE) %>%
select(SampleID, Panel, outlier_range, pc_outlier)
reactable(olink.outliers.filt.df)
Olink also provides a statement of whether each Assay passes their QC checks across all individuals, based on the Incubation Control and the Amplification Control setups per-plate. If an assay has an “Assay Warning”, then we will want to remove the assay from downstream analysis like differential expression or pQTL analyses. The Olink team recommends that we remove any probes that show an assay warning from downstream analyses.
if ("Assay_Warning" %in% colnames(olink.merged.df)) {
<- olink.merged.df %>%
missingness.olink.id.assays group_by(OlinkID) %>%
summarise(Assay_Warning = sum(Assay_Warning == "WARN"), N = n()) %>%
mutate(Fraction_Assay_Warning = Assay_Warning / N)
reactable(missingness.olink.id.assays)
else {
} cat(paste(getwd(), snakemake@input[["merged_olink_data"]], " does not contain Assay Warnings!"))
}
Missingness across Olink probes can be determined by evaluating if there are a large fraction of samples that are below the Limit of Detection (LOD) specified by Olink for a specific assay. Olink does not provide a recommendation for excluding assays that have a high-proportion of missingness, but this is a useful distribution to know for where we may want to filter out several results. The recommendation from Olink is to not remove data below LOD, but this distribution can still be helpful in understanding the quality of your data.
<- olink.merged.df %>%
missingness.olink.id group_by(OlinkID) %>%
summarise(
MissingFreq = median(MissingFreq),
LOD = mean(LOD),
Assay = first(Assay),
Panel = first(Panel),
Normalization = first(Normalization)
)
reactable(missingness.olink.id)
<- missingness.olink.id %>% arrange(desc(MissingFreq))
missingness.olink.id.sort
<- plot_ly(missingness.olink.id.sort, x = ~OlinkID, y = ~MissingFreq, text = ~Assay, name = ~Panel, color = I("black"), type = "bar")
fig <- fig %>%
fig layout(
autosize = TRUE,
showlegend = FALSE,
title = "Distribution of Missingness across Olink Probes",
xaxis = list(title = "Olink Probe ID", categoryorder = "total ascending"),
yaxis = list(title = "Proportion of Missing Values (NPX < LOD)")
)
fig
To evaluate if protein expression co-varies with technical covariates, we also take a look at the association between provided technical covariates and PCs computed using the Olink data matrix. Principal components are denoted as “PC” within the table.
# Evaluating quality control via an anova of technical factors ...
<- fread(snakemake@input[["anova_table"]], sep = "\t")
anova.df
reactable(anova.df)
Some commonly answered questions regarding processing and quality control of Olink data. We suggest the reader to read these to gain a little more insight into the data generated and recommendations for filtering.
The following summarizes the loaded R configuration for the run that created this report.
sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Amazon Linux 2
##
## Matrix products: default
## BLAS/LAPACK: /home/ec2-user/pipelines/proteomics/54gene-olink-qc/.snakemake/conda/dce01d8475ee1861579de54f6c914635_/lib/libopenblasp-r0.3.21.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] reactable_0.3.0 plotly_4.10.1 data.table_1.14.6 RColorBrewer_1.1-3
## [5] forcats_0.5.2 stringr_1.4.1 dplyr_1.0.10 purrr_0.3.5
## [9] readr_2.1.3 tidyr_1.2.1 tibble_3.1.8 ggplot2_3.4.0
## [13] tidyverse_1.3.2 knitr_1.41
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.4 sass_0.4.4 jsonlite_1.8.3
## [4] viridisLite_0.4.1 R.utils_2.12.2 modelr_0.1.10
## [7] bslib_0.4.1 assertthat_0.2.1 googlesheets4_1.0.1
## [10] cellranger_1.1.0 yaml_2.3.6 pillar_1.8.1
## [13] backports_1.4.1 glue_1.6.2 digest_0.6.30
## [16] rvest_1.0.3 colorspace_2.0-3 htmltools_0.5.3
## [19] R.oo_1.25.0 plyr_1.8.8 reactR_0.4.4
## [22] pkgconfig_2.0.3 broom_1.0.1 haven_2.5.1
## [25] scales_1.2.1 tzdb_0.3.0 googledrive_2.0.0
## [28] generics_0.1.3 farver_2.1.1 ellipsis_0.3.2
## [31] cachem_1.0.6 withr_2.5.0 lazyeval_0.2.2
## [34] cli_3.4.1 magrittr_2.0.3 crayon_1.5.2
## [37] readxl_1.4.1 evaluate_0.18 R.methodsS3_1.8.2
## [40] fs_1.5.2 fansi_1.0.3 xml2_1.3.3
## [43] tools_4.2.2 hms_1.1.2 gargle_1.2.1
## [46] lifecycle_1.0.3 munsell_0.5.0 reprex_2.0.2
## [49] compiler_4.2.2 jquerylib_0.1.4 rlang_1.0.6
## [52] grid_4.2.2 htmlwidgets_1.5.4 crosstalk_1.2.0
## [55] rmarkdown_2.18 gtable_0.3.1 DBI_1.1.3
## [58] reshape2_1.4.4 R6_2.5.1 lubridate_1.8.0
## [61] fastmap_1.1.0 utf8_1.2.2 stringi_1.7.8
## [64] Rcpp_1.0.9 vctrs_0.5.1 dbplyr_2.2.1
## [67] tidyselect_1.2.0 xfun_0.35