Skip to contents

[Stable] Identify and flag outliers based on expected number of raw reads per pool.

Usage

outliers_by_pool_fragments(
  metadata,
  key = "BARCODE_MUX",
  outlier_p_value_threshold = 0.01,
  normality_test = FALSE,
  normality_p_value_threshold = 0.05,
  transform_log2 = TRUE,
  per_pool_test = TRUE,
  pool_col = "PoolID",
  min_samples_per_pool = 5,
  flag_logic = "AND",
  keep_calc_cols = TRUE,
  report_path = default_report_path()
)

Arguments

metadata

The metadata data frame

key

A character vector of numeric column names

outlier_p_value_threshold

The p value threshold for a read to be considered an outlier

normality_test

Perform normality test? Normality is assessed for each column in the key using Shapiro-Wilk test and if the values do not follow a normal distribution, other calculations are skipped

normality_p_value_threshold

Normality threshold

transform_log2

Perform a log2 trasformation on values prior the actual calculations?

per_pool_test

Perform the test for each pool?

pool_col

A character vector of the names of the columns that uniquely identify a pool

min_samples_per_pool

The minimum number of samples that a pool needs to contain in order to be processed - relevant only if per_pool_test = TRUE

flag_logic

A character vector of logic operators to obtain a global flag formula - only relevant if the key is longer than one. All operators must be chosen between: AND, OR, XOR, NAND, NOR, XNOR

keep_calc_cols

Keep the calculation columns in the output data frame?

report_path

The path where the report file should be saved. Can be a folder, a file or NULL if no report should be produced. Defaults to {user_home}/ISAnalytics_reports.

Value

A data frame of metadata with the column to_remove

Details

Modular structure

The outlier filtering functions are structured in a modular fashion. There are 2 kind of functions:

  • Outlier tests - Functions that perform some kind of calculation based on inputs and flags metadata

  • Outlier filter - A function that takes one or more outlier tests, combines all the flags with a given logic and filters out rows that are flagged as outliers

This function is an outlier test, and calculates for each column in the key

  • The zscore of the values

  • The tstudent of the values

  • The the associated p-value (tdist)

Optionally the test can be performed for each pool and a normality test can be run prior the actual calculations. Samples are flagged if this condition is respected:

  • tdist < outlier_p_value_threshold & zscore < 0

If the key contains more than one column an additional flag logic can be specified for combining the results. Example: let's suppose the key contains the names of two columns, X and Y key = c("X", "Y") if we specify the the argument flag_logic = "AND" then the reads will be flagged based on this global condition: (tdist_X < outlier_p_value_threshold & zscore_X < 0) AND (tdist_Y < outlier_p_value_threshold & zscore_Y < 0)

The user can specify one or more logical operators that will be applied in sequence.

Examples

data("association_file", package = "ISAnalytics")
flagged <- outliers_by_pool_fragments(association_file,
    report_path = NULL
)
#> Removing NAs from data...
#> Log2 transformation, removing values <= 0
head(flagged)
#> # A tibble: 6 × 91
#>   ProjectID FUSIONID  PoolID TagSequence SubjectID VectorType VectorID
#>   <chr>     <chr>     <chr>  <chr>       <chr>     <chr>      <chr>   
#> 1 PJ01      ET#382.46 POOL01 LTR75LC38   PT001     lenti      GLOBE   
#> 2 PJ01      ET#381.40 POOL01 LTR53LC32   PT001     lenti      GLOBE   
#> 3 PJ01      ET#381.9  POOL01 LTR83LC66   PT001     lenti      GLOBE   
#> 4 PJ01      ET#381.71 POOL01 LTR27LC94   PT001     lenti      GLOBE   
#> 5 PJ01      ET#381.2  POOL01 LTR69LC52   PT001     lenti      GLOBE   
#> 6 PJ01      ET#382.28 POOL01 LTR37LC2    PT001     lenti      GLOBE   
#> # ℹ 84 more variables: ExperimentID <chr>, Tissue <chr>, TimePoint <chr>,
#> #   DNAFragmentation <chr>, PCRMethod <chr>, TagIDextended <chr>,
#> #   Keywords <chr>, CellMarker <chr>, TagID <chr>, NGSProvider <chr>,
#> #   NGSTechnology <chr>, ConverrtedFilesDir <chr>, ConverrtedFilesName <chr>,
#> #   SourceFileFolder <chr>, SourceFileNameR1 <chr>, SourceFileNameR2 <chr>,
#> #   DNAnumber <chr>, ReplicateNumber <int>, DNAextractionDate <date>,
#> #   DNAngUsed <dbl>, LinearPCRID <chr>, LinearPCRDate <date>, …