Identify and flag outliers based on pool fragments.
Source:R/outlier-filtering.R
outliers_by_pool_fragments.Rd
Usage
outliers_by_pool_fragments(
metadata,
key = "BARCODE_MUX",
outlier_p_value_threshold = 0.01,
normality_test = FALSE,
normality_p_value_threshold = 0.05,
transform_log2 = TRUE,
per_pool_test = TRUE,
pool_col = "PoolID",
min_samples_per_pool = 5,
flag_logic = "AND",
keep_calc_cols = TRUE,
report_path = default_report_path()
)
Arguments
- metadata
The metadata data frame
- key
A character vector of numeric column names
- outlier_p_value_threshold
The p value threshold for a read to be considered an outlier
- normality_test
Perform normality test? Normality is assessed for each column in the key using Shapiro-Wilk test and if the values do not follow a normal distribution, other calculations are skipped
- normality_p_value_threshold
Normality threshold
- transform_log2
Perform a log2 trasformation on values prior the actual calculations?
- per_pool_test
Perform the test for each pool?
- pool_col
A character vector of the names of the columns that uniquely identify a pool
- min_samples_per_pool
The minimum number of samples that a pool needs to contain in order to be processed - relevant only if
per_pool_test = TRUE
- flag_logic
A character vector of logic operators to obtain a global flag formula - only relevant if the key is longer than one. All operators must be chosen between: AND, OR, XOR, NAND, NOR, XNOR
- keep_calc_cols
Keep the calculation columns in the output data frame?
- report_path
The path where the report file should be saved. Can be a folder, a file or NULL if no report should be produced. Defaults to
{user_home}/ISAnalytics_reports
.
Details
Modular structure
The outlier filtering functions are structured in a modular fashion. There are 2 kind of functions:
Outlier tests - Functions that perform some kind of calculation based on inputs and flags metadata
Outlier filter - A function that takes one or more outlier tests, combines all the flags with a given logic and filters out rows that are flagged as outliers
This function is an outlier test, and calculates for each column in the key
The zscore of the values
The tstudent of the values
The the associated p-value (tdist)
Optionally the test can be performed for each pool and a normality test can be run prior the actual calculations. Samples are flagged if this condition is respected:
tdist < outlier_p_value_threshold & zscore < 0
If the key contains more than one column an additional flag logic can be
specified for combining the results.
Example:
let's suppose the key contains the names of two columns, X and Y
key = c("X", "Y")
if we specify the the argument flag_logic = "AND"
then the reads will
be flagged based on this global condition:
(tdist_X < outlier_p_value_threshold & zscore_X < 0) AND
(tdist_Y < outlier_p_value_threshold & zscore_Y < 0)
The user can specify one or more logical operators that will be applied in sequence.
See also
Other Data cleaning and pre-processing:
aggregate_metadata()
,
aggregate_values_by_key()
,
compute_near_integrations()
,
default_meta_agg()
,
outlier_filter()
,
purity_filter()
,
realign_after_collisions()
,
remove_collisions()
,
threshold_filter()
Examples
data("association_file", package = "ISAnalytics")
flagged <- outliers_by_pool_fragments(association_file,
report_path = NULL
)
#> Removing NAs from data...
#> Log2 transformation, removing values <= 0
head(flagged)
#> # A tibble: 6 × 91
#> ProjectID FUSIONID PoolID TagSequence SubjectID VectorType VectorID
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 PJ01 ET#382.46 POOL01 LTR75LC38 PT001 lenti GLOBE
#> 2 PJ01 ET#381.40 POOL01 LTR53LC32 PT001 lenti GLOBE
#> 3 PJ01 ET#381.9 POOL01 LTR83LC66 PT001 lenti GLOBE
#> 4 PJ01 ET#381.71 POOL01 LTR27LC94 PT001 lenti GLOBE
#> 5 PJ01 ET#381.2 POOL01 LTR69LC52 PT001 lenti GLOBE
#> 6 PJ01 ET#382.28 POOL01 LTR37LC2 PT001 lenti GLOBE
#> # ℹ 84 more variables: ExperimentID <chr>, Tissue <chr>, TimePoint <chr>,
#> # DNAFragmentation <chr>, PCRMethod <chr>, TagIDextended <chr>,
#> # Keywords <chr>, CellMarker <chr>, TagID <chr>, NGSProvider <chr>,
#> # NGSTechnology <chr>, ConverrtedFilesDir <chr>, ConverrtedFilesName <chr>,
#> # SourceFileFolder <chr>, SourceFileNameR1 <chr>, SourceFileNameR2 <chr>,
#> # DNAnumber <chr>, ReplicateNumber <int>, DNAextractionDate <date>,
#> # DNAngUsed <dbl>, LinearPCRID <chr>, LinearPCRDate <date>, …