Standardize a SingleCellExperiment
Helen Lindsay
2022-07-11
Source:vignettes/renameSingleCellExp.Rmd
renameSingleCellExp.Rmd
This short vignette shows how to standardize the names of the antibody-derived tags in a CITE-seq experiment in SingleCellExperiment format.
Example 1: Kotliarov PBMC
To demonstrate, we will use the “Kotliarov” dataset available in the package “scRNAseq”.
Load the Kotliarov CITE-seq data
library("scRNAseq")
library("AbNames")
kotliarov <- KotliarovPBMCData(mode = "adt")
head(rownames(kotliarov))
## [1] "AnnexinV_PROT" "BTLA_PROT" "CD117_PROT" "CD123_PROT"
## [5] "CD13_PROT" "CD133_PROT"
Create a data frame to allow matching to the CITEseq data set
To directly match to the CITEseq data set, some formatting of the antigen names may be required. Here we remove the "_PROT" suffix. An alternative is to first map antigen names to the gene_aliases data set and then match antigens and identifiers to the CITEseq data set. However, it is still a good idea to remove prefixes and suffixes to avoid false matches. For example, “PROT” is an alias of the gene SLC6A7 and would match if not removed.
You can include information such as catalogue number and antibody clone to assist matching if it is available. When matching to the CITEseq data set, the names of the columns to use for matching must equal those in the CITEseq data set. For example, below we match using only the antigen name, using the column “Antigen”. By default, columns “Clone”, “Antigen”, and “Cat_Number” (catalogue number) are used for matching if available.
# Create a data.frame for matching to the CITEseq data set
# Column "Antigen" will be used for matching
df <- data.frame(Original = rownames(kotliarov),
# Remove _PROT (optionally with a space before)
Antigen = gsub(" ?_PROT", "", rownames(kotliarov)))
head(df)
## Original Antigen
## 1 AnnexinV_PROT AnnexinV
## 2 BTLA_PROT BTLA
## 3 CD117_PROT CD117
## 4 CD123_PROT CD123
## 5 CD13_PROT CD13
## 6 CD133_PROT CD133
Match to the CITEseq data set and review suggested consensus names
It is a good idea to review matches as errors in e.g. catalogue number will lead to incorrect matches.
The function matchToCiteseq
will report which columns are used for matching. Note that unless specified otherwise, the entries of the CITEseq data set are matched using the default columns even if these do not exist in the new data set.
df <- matchToCiteseq(df)
## Matching new data to citeseq data using columns:
## Antigen
## Grouping data using columns:
## Antigen, Cat_Number, Clone, ALT_ID
## # A tibble: 6 × 4
## Antigen_std Original Antigen n_matched
## <chr> <chr> <chr> <int>
## 1 AnnexinV AnnexinV_PROT AnnexinV 1
## 2 CD272 BTLA_PROT BTLA 12
## 3 CD117 CD117_PROT CD117 18
## 4 CD123 CD123_PROT CD123 24
## 5 CD13 CD13_PROT CD13 9
## 6 CD133 CD133_PROT CD133 15
In general, the names of the original experiment matched the most commonly used name in the CITEseq data set, but we can see that some have different “standard” names.
df[! df$Antigen == df$Antigen_std, ]
## # A tibble: 2 × 4
## Antigen_std Original Antigen n_matched
## <chr> <chr> <chr> <int>
## 1 CD272 BTLA_PROT BTLA 12
## 2 HLA-A,B,C HLA-ABC_PROT HLA-ABC 15
By looking at the n_matched
column, we can see how many entries in the cite-seq data set (includeing the new query data) could be matched to each antigen. Here we see that only AnnexinV and the isotype controls could not be matched by name only. That is, the only match came from the Kotliarov data set itself.
## # A tibble: 10 × 4
## Antigen_std Original Antigen n_mat…¹
## <chr> <chr> <chr> <int>
## 1 AnnexinV AnnexinV_PROT AnnexinV 1
## 2 MouseIgG1kappaisotype MouseIgG1kappaisotype_PROT MouseIgG1kappaiso… 1
## 3 MouseIgG2akappaisotype MouseIgG2akappaisotype_PROT MouseIgG2akappais… 1
## 4 Mouse IgG2bkIsotype Mouse IgG2bkIsotype_PROT Mouse IgG2bkIsoty… 1
## 5 RatIgG2bkIsotype RatIgG2bkIsotype_PROT RatIgG2bkIsotype 1
## 6 CD13 CD13_PROT CD13 9
## 7 CX3CR1 CX3CR1_PROT CX3CR1 9
## 8 IgA IgA_PROT IgA 10
## 9 CD18 CD18_PROT CD18 11
## 10 CD70 CD70_PROT CD70 11
## # … with abbreviated variable name ¹n_matched
Rename single cell experiment
# To rename the singleCellExperiment, we need a named vector of the new names,
# with names being the original names of the singleCellExperiment.
# "structure" is a method that allows us to create a named vector in one step
new_nms <- structure(df[[2]], names = rownames(kotliarov))
print(head(new_nms))
## AnnexinV_PROT BTLA_PROT CD117_PROT CD123_PROT CD13_PROT
## "AnnexinV_PROT" "BTLA_PROT" "CD117_PROT" "CD123_PROT" "CD13_PROT"
## CD133_PROT
## "CD133_PROT"
## [1] "AnnexinV_PROT" "BTLA_PROT" "CD117_PROT" "CD123_PROT"
## [5] "CD13_PROT" "CD133_PROT"
Example 2: CiteFuse example data
In this example, the CITE-seq data are stored as an altExp. The procedure for standardising the antibody names is the same.
Matching directly to the CITEseq data set
# Load CiteFuse example data
library(CiteFuse)
data("CITEseq_example", package = "CiteFuse")
sce_citeseq <- preprocessing(CITEseq_example)
rownames(altExp(sce_citeseq, "ADT"))
## [1] "B220 (CD45R)" "B7-H1 (PD-L1)" "C-kit (CD117)"
## [4] "CCR7" "CD11b" "CD11c"
## [7] "CD138" "CD14" "CD16"
## [10] "CD19" "CD1a" "CD2"
## [13] "CD223 (lag3)" "CD24" "CD26 (Adenosine)"
## [16] "CD27" "CD28" "CD3"
## [19] "CD34" "CD366 (tim3)" "CD4"
## [22] "CD44" "CD45" "CD45RA"
## [25] "CD45RO" "CD5" "CD56"
## [28] "CD62L" "CD66b" "CD69"
## [31] "CD7" "CD77" "CD8"
## [34] "CTLA4" "EpCAM (CD326)" "HLA-A,B,C"
## [37] "IL7Ralpha (CD127)" "IgG1" "IgG2a"
## [40] "LAMP1" "MHCII (HLA-DR)" "Ox40 (CD134)"
## [43] "PD-1 (CD279)" "PD-L1 (CD274)" "PD1 (CD279)"
## [46] "PECAM (CD31)" "Siglec-8" "TCRb"
## [49] "TCRg"
df <- data.frame(Antigen = rownames(altExp(sce_citeseq, "ADT")))
df <- matchToCiteseq(df)
# Print entries that differ from the original
df[! df$Antigen == df$Antigen_std, ]
## # A tibble: 21 × 3
## Antigen_std Antigen n_matched
## <chr> <chr> <int>
## 1 B220 B220 (CD45R) 6
## 2 B7-H1 B7-H1 (PD-L1) 1
## 3 CD117 C-kit (CD117) 18
## 4 CD197 CCR7 26
## 5 CD223 CD223 (lag3) 18
## 6 CD26 CD26 (Adenosine) 17
## 7 CD366 CD366 (tim3) 18
## 8 CD152 CTLA4 18
## 9 CD326 EpCAM (CD326) 11
## 10 CD127 IL7Ralpha (CD127) 29
## # … with 11 more rows
## # A tibble: 10 × 3
## Antigen_std Antigen n_matched
## <chr> <chr> <int>
## 1 B7-H1 B7-H1 (PD-L1) 1
## 2 CD77 CD77 3
## 3 B220 B220 (CD45R) 6
## 4 Siglec-8 Siglec-8 6
## 5 CD326 EpCAM (CD326) 11
## 6 CD1a CD1a 12
## 7 CD107a LAMP1 12
## 8 CD66b CD66b 14
## 9 CD134 Ox40 (CD134) 14
## 10 TCR a/b TCRb 14
Here we see that the consensus name for “IgG1” is “Mouse IgG1, kappa isotype Ctrl”. This is because when other studies refer to “IgG1”, we can tell from the antibody clone ID that they are referring to an isotype control with unknown specificity. Without further information, we cannot tell whether the antibody in the CiteFuse example data is an isotype control or an anti-human secondary antibody.
Similarly for TCRa and TCRb, we can tell by matching antigens by clone ID that these are more commonly referred to as TCR a/b (alpha/beta) and TCR g/d (gamma/delta) respectively.
Check matching entries in CITEseq data set
Suppose, as in the above example, we are not sure if a consensus name is correct. We may wish to examine the matching entries in the citeseq
data set. One way to do this is to load the citeseq
data set and check the matching entries using the function ``.
# We load dplyr to simplify data.frame manipulations.
library(dplyr)
# Load citeseq data set
data(citeseq)
tcrb <- citeseq %>%
# We will first select entries that equal "TCRb"
dplyr::filter(Antigen == "TCRb") %>%
# Then subset to just the columns we want to use for matching
dplyr::select(Antigen, Clone, Cat_Number, ALT_ID)
# Then we will find entries in citeseq that match any of these columns.
# filter_by_union is a function to return rows where any the value of any
# column occurs in a reference data.frame
citeseq %>%
filter_by_union(tcrb) %>%
# Select just columns of interest for viewing
dplyr::select(Study, Antigen, Clone, Cat_Number, ALT_ID)
## Study Antigen Clone Cat_Number
## 1 Hao_2021 TCR alpha/beta IP26 306737
## 2 Liu_2021 TCR alpha/beta IP26 306743
## 3 Nathan_2021 TCR alpha/beta (TRAC/TRBC) IP26 306737
## 4 Mimitou_2021 TCR a/b IP26 306737
## 5 Qian_2020 TCR a/b IP26 306737
## 6 Stephenson_2021 TCR a/b IP26 306743
## 7 Su_2020 TCR a/b IP26 306743
## 8 LeCoz_2021 TCR a/b IP26 306743
## 9 Wu_2021 TCRab IP26 306737
## 10 PomboAntunes_2021 TCRab IP26 306737
## 11 Buus_2021 TCRab IP26 306743
## 12 Hao_2021 TCRb IP26 <NA>
## 13 Mimitou_2019 TCRb IP26 <NA>
## ALT_ID
## 1 HGNC:12027, HGNC:12155
## 2 HGNC:12027, HGNC:12155
## 3 HGNC:12027, HGNC:12155
## 4 HGNC:12027, HGNC:12155
## 5 HGNC:12027, HGNC:12155
## 6 HGNC:12027, HGNC:12155
## 7 HGNC:12027, HGNC:12155
## 8 HGNC:12027, HGNC:12155
## 9 HGNC:12027, HGNC:12155
## 10 HGNC:12027, HGNC:12155
## 11 HGNC:12027, HGNC:12155
## 12 HGNC:12155
## 13 HGNC:12155
From the above result we can confirm that two studies have an antigen named “TCRb”. By matching on antibody clone, we see that it is more commonly referred to as TCRab or TCR alpha/beta.
A note about the ALT_IDs
The column ALT_ID in the data set gene_aliases
is an ID that allows the antigens in the citeseq
data set to be distinguished. If the HGNC ID is sufficient, then the ALT_ID will be the HGNC_ID. If not, it may be an ID from e.g. the protein ontology or the National Cancer Institute thesaurus, or if no stable ID was found, the antibody and clone name.
Session Info
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] dplyr_1.0.10 CiteFuse_1.8.0
## [3] AbNames_0.2.0 scRNAseq_2.10.0
## [5] SingleCellExperiment_1.18.0 SummarizedExperiment_1.26.1
## [7] Biobase_2.56.0 GenomicRanges_1.48.0
## [9] GenomeInfoDb_1.32.4 IRanges_2.30.1
## [11] S4Vectors_0.34.0 BiocGenerics_0.42.0
## [13] MatrixGenerics_1.8.1 matrixStats_0.62.0
## [15] BiocStyle_2.24.0
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.2 tidyselect_1.1.2
## [3] RSQLite_2.2.17 AnnotationDbi_1.58.0
## [5] grid_4.2.1 BiocParallel_1.30.3
## [7] Rtsne_0.16 munsell_0.5.0
## [9] ScaledMatrix_1.4.1 codetools_0.2-18
## [11] ragg_1.2.2 statmod_1.4.37
## [13] scran_1.24.1 withr_2.5.0
## [15] colorspace_2.0-3 filelock_1.0.2
## [17] knitr_1.40 GenomeInfoDbData_1.2.8
## [19] polyclip_1.10-0 bit64_4.0.5
## [21] farver_2.1.1 pheatmap_1.0.12
## [23] rhdf5_2.40.0 rprojroot_2.0.3
## [25] vctrs_0.4.1 generics_0.1.3
## [27] xfun_0.33 BiocFileCache_2.4.0
## [29] randomForest_4.7-1.1 R6_2.5.1
## [31] graphlayouts_0.8.1 rsvd_1.0.5
## [33] locfit_1.5-9.6 AnnotationFilter_1.20.0
## [35] bitops_1.0-7 rhdf5filters_1.8.0
## [37] cachem_1.0.6 DelayedArray_0.22.0
## [39] assertthat_0.2.1 promises_1.2.0.1
## [41] BiocIO_1.6.0 scales_1.2.1
## [43] ggraph_2.0.6 gtable_0.3.1
## [45] beachmat_2.12.0 tidygraph_1.2.2
## [47] ensembldb_2.20.2 rlang_1.0.5
## [49] systemfonts_1.0.4 splines_4.2.1
## [51] rtracklayer_1.56.1 lazyeval_0.2.2
## [53] BiocManager_1.30.18 yaml_2.3.5
## [55] reshape2_1.4.4 GenomicFeatures_1.48.3
## [57] httpuv_1.6.6 tools_4.2.1
## [59] bookdown_0.29 ggplot2_3.3.6
## [61] ellipsis_0.3.2 jquerylib_0.1.4
## [63] RColorBrewer_1.1-3 ggridges_0.5.3
## [65] Rcpp_1.0.9 plyr_1.8.7
## [67] sparseMatrixStats_1.8.0 progress_1.2.2
## [69] zlibbioc_1.42.0 purrr_0.3.4
## [71] RCurl_1.98-1.8 prettyunits_1.1.1
## [73] dbscan_1.1-10 viridis_0.6.2
## [75] cowplot_1.1.1 ggrepel_0.9.1
## [77] cluster_2.1.4 fs_1.5.2
## [79] magrittr_2.0.3 ProtGenerics_1.28.0
## [81] hms_1.1.2 mime_0.12
## [83] evaluate_0.16 xtable_1.8-4
## [85] XML_3.99-0.10 gridExtra_2.3
## [87] compiler_4.2.1 biomaRt_2.52.0
## [89] tibble_3.1.8 crayon_1.5.1
## [91] htmltools_0.5.3 segmented_1.6-0
## [93] later_1.3.0 propr_4.2.6
## [95] tidyr_1.2.1 DBI_1.1.3
## [97] tweenr_2.0.2 ExperimentHub_2.4.0
## [99] dbplyr_2.2.1 MASS_7.3-58.1
## [101] rappdirs_0.3.3 Matrix_1.5-1
## [103] cli_3.4.0 metapod_1.4.0
## [105] parallel_4.2.1 igraph_1.3.4
## [107] pkgconfig_2.0.3 pkgdown_2.0.6
## [109] GenomicAlignments_1.32.1 scuttle_1.6.3
## [111] xml2_1.3.3 bslib_0.4.0
## [113] dqrng_0.3.0 XVector_0.36.0
## [115] stringr_1.4.1 digest_0.6.29
## [117] Biostrings_2.64.1 rmarkdown_2.16
## [119] uwot_0.1.14 edgeR_3.38.4
## [121] DelayedMatrixStats_1.18.0 restfulr_0.0.15
## [123] curl_4.3.2 kernlab_0.9-31
## [125] shiny_1.7.2 Rsamtools_2.12.0
## [127] rjson_0.2.21 lifecycle_1.0.2
## [129] nlme_3.1-159 jsonlite_1.8.0
## [131] Rhdf5lib_1.18.2 BiocNeighbors_1.14.0
## [133] desc_1.4.2 viridisLite_0.4.1
## [135] limma_3.52.3 fansi_1.0.3
## [137] pillar_1.8.1 lattice_0.20-45
## [139] KEGGREST_1.36.3 fastmap_1.1.0
## [141] httr_1.4.4 survival_3.4-0
## [143] interactiveDisplayBase_1.34.0 glue_1.6.2
## [145] png_0.1-7 bluster_1.6.0
## [147] BiocVersion_3.15.2 bit_4.0.4
## [149] ggforce_0.3.4 stringi_1.7.8
## [151] sass_0.4.2 mixtools_1.2.0
## [153] blob_1.2.3 textshaping_0.3.6
## [155] BiocSingular_1.12.0 AnnotationHub_3.4.0
## [157] memoise_2.0.1 irlba_2.3.5