Statistical approach for the validation of common insertion sites significance based on the comparison of the integration frequency at the CIS gene with respect to other genes contained in the surrounding genomic regions. For more details please refer to this paper: https://ashpublications.org/blood/article/117/20/5332/21206/Lentiviral-vector-common-integration-sites-in
Usage
CIS_grubbs(
x,
genomic_annotation_file = "hg19",
grubbs_flanking_gene_bp = 1e+05,
threshold_alpha = 0.05,
by = NULL,
return_missing_as_df = TRUE,
results_as_list = TRUE
)
Arguments
- x
An integration matrix, must include the
mandatory_IS_vars()
columns and theannotation_IS_vars()
columns- genomic_annotation_file
Database file for gene annotation, see details.
- grubbs_flanking_gene_bp
Number of base pairs flanking a gene
- threshold_alpha
Significance threshold
- by
Either
NULL
or a character vector of column names. If not NULL, the function will perform calculations for each group and return a list of data frames with the results. E.g. forby = "SubjectID"
, CIS will be computed for each distinct SubjectID found in the table ("SubjectID" column must be included in the input data frame).- return_missing_as_df
Returns those genes present in the input df but not in the refgenes as a data frame?
- results_as_list
If
TRUE
return the group computations as a named list, otherwise return a single df with an additional column containing the group id
Details
Genomic annotation file
A data frame containing
genes annotation for the specific genome.
From version 1.5.4
the argument genomic_annotation_file
accepts only
data frames or package provided defaults.
The user is responsible for importing the appropriate tabular files if
customization is needed.
The annotations for the human genome (hg19) and
murine genome (mm9) are already
included in this package: to use one of them just
set the argument genomic_annotation_file
to either "hg19"
or
"mm9"
.
If for any reason the user is performing an analysis on another genome,
this file needs to be changed respecting the USCS Genome Browser
format, meaning the input file headers should include:
name2, chrom, strand, min_txStart, max_txEnd, minmax_TxLen, average_TxLen, name, min_cdsStart, max_cdsEnd, minmax_CdsLen, average_CdsLen
Required tags
The function will explicitly check for the presence of these tags:
chromosome
locus
is_strand
gene_symbol
gene_strand
See also
Other Analysis functions:
HSC_population_size_estimate()
,
compute_abundance()
,
cumulative_is()
,
gene_frequency_fisher()
,
is_sharing()
,
iss_source()
,
sample_statistics()
,
top_integrations()
,
top_targeted_genes()
Examples
data("integration_matrices", package = "ISAnalytics")
cis <- CIS_grubbs(integration_matrices)
#> Warning: Warning: missing genes in refgenes table
#> ℹ A total of 5 genes were found in the input data but not in the refgene table. This may be caused by a mismatch in the annotation phase of the matrix. Here is a summary:
#> GeneName GeneStrand chr
#> 1: CRELD2 - 15
#> 2: HTR4 + 6
#> 3: UBE2D2 + 16
#> 4: LINC01133 + 19
#> 5: PLEKHG4B - 14
#> ℹ NOTE: missing genes will be removed from the final output! Review results carefully
#> ℹ A total of 25 IS will be removed because of missing genes ( 1.48 % of total IS in input)
cis
#> $missing_genes
#> GeneName GeneStrand chr
#> 1: CRELD2 - 15
#> 2: HTR4 + 6
#> 3: UBE2D2 + 16
#> 4: LINC01133 + 19
#> 5: PLEKHG4B - 14
#>
#> $missing_is
#> $missing_is$absolute
#> [1] 25
#>
#> $missing_is$perc
#> [1] 1.48
#>
#>
#> $cis
#> # A tibble: 501 × 37
#> GeneName GeneStrand chr n mean sd median trimmed mad min
#> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ABHD16A - 6 3 31654876 0 3.17e7 3.17e7 0 3.17e7
#> 2 ABHD2 + 15 3 89642998 0 8.96e7 8.96e7 0 8.96e7
#> 3 ACAP2 - 3 2 195107772 0 1.95e8 1.95e8 0 1.95e8
#> 4 ACOX1 - 17 10 73960093. 8844. 7.40e7 7.40e7 4236. 7.40e7
#> 5 ACSM3 + 16 4 20777188 0 2.08e7 2.08e7 0 2.08e7
#> 6 ADD1 + 4 8 2865706. 6694. 2.87e6 2.87e6 8804. 2.86e6
#> 7 ADGRA3 - 4 2 22505019 0 2.25e7 2.25e7 0 2.25e7
#> 8 ADGRB3 + 6 4 69397949 0 6.94e7 6.94e7 0 6.94e7
#> 9 ADGRG3 + 16 3 57716345 0 5.77e7 5.77e7 0 5.77e7
#> 10 AHR + 7 2 17227687 0 1.72e7 1.72e7 0 1.72e7
#> # ℹ 491 more rows
#> # ℹ 27 more variables: max <dbl>, range <dbl>, skew <dbl>, kurtosis <dbl>,
#> # n_IS_perGene <int>, min_bp_integration_locus <dbl>,
#> # max_bp_integration_locus <dbl>, IS_span_bp <dbl>,
#> # avg_bp_integration_locus <dbl>, median_bp_integration_locus <dbl>,
#> # distinct_orientations <int>, average_TxLen <dbl>,
#> # raw_gene_integration_frequency <dbl>, …
#>