Filter out outliers in metadata, identified by the chosen outlier test.
Source:R/outlier-filtering.R
outlier_filter.Rd
Usage
outlier_filter(
metadata,
pcr_id_col = pcr_id_column(),
outlier_test = c(outliers_by_pool_fragments),
outlier_test_outputs = NULL,
combination_logic = c("AND"),
negate = FALSE,
report_path = default_report_path(),
...
)
Arguments
- metadata
The metadata data frame
- pcr_id_col
The name of the pcr identifier column
- outlier_test
One or more outlier tests. Must be functions, either from
available_outlier_tests()
or custom functions that produce an appropriate output format (see details).- outlier_test_outputs
NULL
, a data frame or a list of data frames. See details.- combination_logic
One or more logical operators ("AND", "OR", "XOR", "NAND", "NOR", "XNOR"). See datails.
- negate
If
TRUE
will return only the metadata that was flagged to be removed. IfFALSE
will return only the metadata that wasn't flagged to be removed.- report_path
The path where the report file should be saved. Can be a folder or
NULL
if no report should be produced. Defaults to{user_home}/ISAnalytics_reports
.- ...
Additional named arguments passed to
outliers_test
Details
Modular structure
The outlier filtering functions are structured in a modular fashion. There are 2 kind of functions:
Outlier tests - Functions that perform some kind of calculation based on inputs and flags metadata
Outlier filter - A function that takes one or more outlier tests, combines all the flags with a given logic and filters out rows that are flagged as outliers
This function acts as the filter. It can either take one or more outlier
tests as functions and call them through the argument outlier_test
,
or it can take directly outputs produced by individual tests in
the argument outlier_test_outputs
- if both are provided the second one
has priority. The second method offers a bit more freedom, since single
tests can be run independently and intermediate results saved and examined
more in detail. If more than one test is to be performed, the argument
combination_logic
tells the function how to combine the flags: you can
specify 1 logical operator or more than 1, provided it is compatible
with the number of tests.
Writing custom outlier tests
You have the freedom to provide your own functions as outlier tests. For this purpose, functions provided must respect this guidelines:
Must take as input the whole metadata df
Must return a df containing AT LEAST the
pcr_id_col
and a logical column"to_remove"
that contains the flagThe
pcr_id_col
must contain all the values originally present in the metadata df
See also
Other Data cleaning and pre-processing:
aggregate_metadata()
,
aggregate_values_by_key()
,
compute_near_integrations()
,
default_meta_agg()
,
outliers_by_pool_fragments()
,
purity_filter()
,
realign_after_collisions()
,
remove_collisions()
,
threshold_filter()
Examples
data("association_file", package = "ISAnalytics")
filtered_af <- outlier_filter(association_file,
key = "BARCODE_MUX",
report_path = NULL
)
#> Removing NAs from data...
#> Log2 transformation, removing values <= 0
head(filtered_af)
#> ProjectID FUSIONID PoolID TagSequence SubjectID VectorType VectorID
#> 1: PJ01 ET#382.46 POOL01 LTR75LC38 PT001 lenti GLOBE
#> 2: PJ01 ET#381.40 POOL01 LTR53LC32 PT001 lenti GLOBE
#> 3: PJ01 ET#381.9 POOL01 LTR83LC66 PT001 lenti GLOBE
#> 4: PJ01 ET#381.71 POOL01 LTR27LC94 PT001 lenti GLOBE
#> 5: PJ01 ET#381.2 POOL01 LTR69LC52 PT001 lenti GLOBE
#> 6: PJ01 ET#382.28 POOL01 LTR37LC2 PT001 lenti GLOBE
#> ExperimentID Tissue TimePoint DNAFragmentation PCRMethod TagIDextended
#> 1: <NA> PB 0060 SONIC SLiM LTR75LC38
#> 2: <NA> BM 0180 SONIC SLiM LTR53LC32
#> 3: <NA> BM 0180 SONIC SLiM LTR83LC66
#> 4: <NA> BM 0180 SONIC SLiM LTR27LC94
#> 5: <NA> PB 0180 SONIC SLiM LTR69LC52
#> 6: <NA> BM 0060 SONIC SLiM LTR37LC2
#> Keywords CellMarker TagID NGSProvider NGSTechnology ConverrtedFilesDir
#> 1: <NA> MNC LTR75.LC38 <NA> HiSeq <NA>
#> 2: <NA> MNC LTR53.LC32 <NA> HiSeq <NA>
#> 3: <NA> MNC LTR83.LC66 <NA> HiSeq <NA>
#> 4: <NA> MNC LTR27.LC94 <NA> HiSeq <NA>
#> 5: <NA> MNC LTR69.LC52 <NA> HiSeq <NA>
#> 6: <NA> MNC LTR37.LC2 <NA> HiSeq <NA>
#> ConverrtedFilesName SourceFileFolder SourceFileNameR1 SourceFileNameR2
#> 1: <NA> <NA> <NA> <NA>
#> 2: <NA> <NA> <NA> <NA>
#> 3: <NA> <NA> <NA> <NA>
#> 4: <NA> <NA> <NA> <NA>
#> 5: <NA> <NA> <NA> <NA>
#> 6: <NA> <NA> <NA> <NA>
#> DNAnumber ReplicateNumber DNAextractionDate DNAngUsed LinearPCRID
#> 1: PT001-103 3 2016-03-16 23.184 <NA>
#> 2: PT001-81 2 2016-07-15 181.440 <NA>
#> 3: PT001-81 1 2016-07-15 181.440 <NA>
#> 4: PT001-81 3 2016-07-15 181.440 <NA>
#> 5: PT001-74 1 2016-07-15 23.058 <NA>
#> 6: PT001-107 2 2016-03-16 171.360 <NA>
#> LinearPCRDate SonicationDate LigationDate 1stExpoPCRID 1stExpoPCRDate
#> 1: <NA> 2016-11-02 2016-11-02 ET#380.46 2016-11-02
#> 2: <NA> 2016-11-02 2016-11-02 ET#379.40 2016-11-02
#> 3: <NA> 2016-11-02 2016-11-02 ET#379.9 2016-11-02
#> 4: <NA> 2016-11-02 2016-11-02 ET#379.71 2016-11-02
#> 5: <NA> 2016-11-02 2016-11-02 ET#379.2 2016-11-02
#> 6: <NA> 2016-11-02 2016-11-02 ET#380.28 2016-11-02
#> 2ndExpoID 2ndExpoDate FusionPrimerPCRID FusionPrimerPCRDate PoolDate
#> 1: <NA> <NA> ET#382.46 2016-11-03 2016-11-07
#> 2: <NA> <NA> ET#381.40 2016-11-03 2016-11-07
#> 3: <NA> <NA> ET#381.9 2016-11-03 2016-11-07
#> 4: <NA> <NA> ET#381.71 2016-11-03 2016-11-07
#> 5: <NA> <NA> ET#381.2 2016-11-03 2016-11-07
#> 6: <NA> <NA> ET#382.28 2016-11-03 2016-11-07
#> SequencingDate VCN Genome SequencingRound Genotype TestGroup MOI
#> 1: 2016-11-15 0.30 hg19 1 <NA> <NA> <NA>
#> 2: 2016-11-15 0.27 hg19 1 <NA> <NA> <NA>
#> 3: 2016-11-15 0.27 hg19 1 <NA> <NA> <NA>
#> 4: 2016-11-15 0.27 hg19 1 <NA> <NA> <NA>
#> 5: 2016-11-15 0.24 hg19 1 <NA> <NA> <NA>
#> 6: 2016-11-15 0.42 hg19 1 <NA> <NA> <NA>
#> Engraftment Transduction Notes AddedField1 AddedField2 AddedField3
#> 1: NA NA <NA> <NA> <NA> <NA>
#> 2: NA NA <NA> <NA> <NA> <NA>
#> 3: NA NA <NA> <NA> <NA> <NA>
#> 4: NA NA <NA> <NA> <NA> <NA>
#> 5: NA NA <NA> <NA> <NA> <NA>
#> 6: NA NA <NA> <NA> <NA> <NA>
#> AddedField4 concatenatePoolIDSeqRun AddedField6_RelativeBloodPercentage
#> 1: <NA> POOL01-1 <NA>
#> 2: <NA> POOL01-1 <NA>
#> 3: <NA> POOL01-1 <NA>
#> 4: <NA> POOL01-1 <NA>
#> 5: <NA> POOL01-1 <NA>
#> 6: <NA> POOL01-1 <NA>
#> AddedField7_PurityTestFeasibility AddedField8_FacsSeparationPurity Kapa
#> 1: NA NA NA
#> 2: NA NA NA
#> 3: NA NA NA
#> 4: NA NA NA
#> 5: NA NA NA
#> 6: NA NA NA
#> ulForPool
#> 1: NA
#> 2: NA
#> 3: NA
#> 4: NA
#> 5: NA
#> 6: NA
#> CompleteAmplificationID
#> 1: PJ01_POOL01_LTR75LC38_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC
#> 2: PJ01_POOL01_LTR53LC32_PT001_PT001-81_lenti_GLOBE_BM_1_SLiM_0180_MNC
#> 3: PJ01_POOL01_LTR83LC66_PT001_PT001-81_lenti_GLOBE_BM_1_SLiM_0180_MNC
#> 4: PJ01_POOL01_LTR27LC94_PT001_PT001-81_lenti_GLOBE_BM_1_SLiM_0180_MNC
#> 5: PJ01_POOL01_LTR69LC52_PT001_PT001-74_lenti_GLOBE_PB_1_SLiM_0180_MNC
#> 6: PJ01_POOL01_LTR37LC2_PT001_PT001-107_lenti_GLOBE_BM_1_SLiM_0060_MNC
#> UniqueID StudyTestID StudyTestGroup MouseID Tigroup Tisource
#> 1: ID00000000000000007433 <NA> NA NA <NA> <NA>
#> 2: ID00000000000000007340 <NA> NA NA <NA> <NA>
#> 3: ID00000000000000007310 <NA> NA NA <NA> <NA>
#> 4: ID00000000000000007370 <NA> NA NA <NA> <NA>
#> 5: ID00000000000000007303 <NA> NA NA <NA> <NA>
#> 6: ID00000000000000007417 <NA> NA NA <NA> <NA>
#> PathToFolderProjectID SamplesNameCheck TimepointDays TimepointMonths
#> 1: /PJ01 <NA> 0060 02
#> 2: /PJ01 <NA> 0180 06
#> 3: /PJ01 <NA> 0180 06
#> 4: /PJ01 <NA> 0180 06
#> 5: /PJ01 <NA> 0180 06
#> 6: /PJ01 <NA> 0060 02
#> TimepointYears ng DNA corrected RUN_NAME PHIX_MAPPING
#> 1: 01 23.18 PJ01|POOL01-1 43586699
#> 2: 01 181.44 PJ01|POOL01-1 43586699
#> 3: 01 181.44 PJ01|POOL01-1 43586699
#> 4: 01 181.44 PJ01|POOL01-1 43586699
#> 5: 01 23.06 PJ01|POOL01-1 43586699
#> 6: 01 171.36 PJ01|POOL01-1 43586699
#> PLASMID_MAPPED_BYPOOL BARCODE_MUX LTR_IDENTIFIED TRIMMING_FINAL_LTRLC
#> 1: 2256176 645026 645026 630965
#> 2: 2256176 652208 652177 649044
#> 3: 2256176 451519 451512 449669
#> 4: 2256176 426500 426499 425666
#> 5: 2256176 18300 18300 18290
#> 6: 2256176 729327 729327 727219
#> LV_MAPPED BWA_MAPPED_OVERALL ISS_MAPPED_OVERALL RAW_READS QUALITY_PASSED
#> 1: 211757 402477 219452 NA NA
#> 2: 303300 322086 222646 NA NA
#> 3: 204810 227275 149385 NA NA
#> 4: 185752 223915 143283 NA NA
#> 5: 6962 10487 5907 NA NA
#> 6: 318653 369117 235640 NA NA
#> ISS_MAPPED_PP
#> 1: NA
#> 2: NA
#> 3: NA
#> 4: NA
#> 5: NA
#> 6: NA