Standardize a SingleCellExperiment

This short vignette shows how to standardize the names of the antibody-derived tags in a CITE-seq experiment in SingleCellExperiment format.

Example 1: Kotliarov PBMC

To demonstrate, we will use the “Kotliarov” dataset available in the package “scRNAseq”.

Load the Kotliarov CITE-seq data

library("scRNAseq")
library("AbNames")

kotliarov <- KotliarovPBMCData(mode = "adt")
head(rownames(kotliarov))

## [1] "AnnexinV_PROT" "BTLA_PROT"     "CD117_PROT"    "CD123_PROT"   
## [5] "CD13_PROT"     "CD133_PROT"

Create a data frame to allow matching to the CITEseq data set

To directly match to the CITEseq data set, some formatting of the antigen names may be required. Here we remove the "_PROT" suffix. An alternative is to first map antigen names to the gene_aliases data set and then match antigens and identifiers to the CITEseq data set. However, it is still a good idea to remove prefixes and suffixes to avoid false matches. For example, “PROT” is an alias of the gene SLC6A7 and would match if not removed.

You can include information such as catalogue number and antibody clone to assist matching if it is available. When matching to the CITEseq data set, the names of the columns to use for matching must equal those in the CITEseq data set. For example, below we match using only the antigen name, using the column “Antigen”. By default, columns “Clone”, “Antigen”, and “Cat_Number” (catalogue number) are used for matching if available.

# Create a data.frame for matching to the CITEseq data set
# Column "Antigen" will be used for matching
df <- data.frame(Original = rownames(kotliarov),
                 # Remove _PROT (optionally with a space before)
                 Antigen = gsub(" ?_PROT", "", rownames(kotliarov)))
head(df)

##        Original  Antigen
## 1 AnnexinV_PROT AnnexinV
## 2     BTLA_PROT     BTLA
## 3    CD117_PROT    CD117
## 4    CD123_PROT    CD123
## 5     CD13_PROT     CD13
## 6    CD133_PROT    CD133

Match to the CITEseq data set and review suggested consensus names

It is a good idea to review matches as errors in e.g. catalogue number will lead to incorrect matches.

The function matchToCiteseq will report which columns are used for matching. Note that unless specified otherwise, the entries of the CITEseq data set are matched using the default columns even if these do not exist in the new data set.

df <- matchToCiteseq(df)

## Matching new data to citeseq data using columns:
## Antigen

## Grouping data using columns:
## Antigen, Cat_Number, Clone, ALT_ID

print(head(df))

## # A tibble: 6 × 4
##   Antigen_std Original      Antigen  n_matched
##   <chr>       <chr>         <chr>        <int>
## 1 AnnexinV    AnnexinV_PROT AnnexinV         1
## 2 CD272       BTLA_PROT     BTLA            12
## 3 CD117       CD117_PROT    CD117           18
## 4 CD123       CD123_PROT    CD123           24
## 5 CD13        CD13_PROT     CD13             9
## 6 CD133       CD133_PROT    CD133           15

In general, the names of the original experiment matched the most commonly used name in the CITEseq data set, but we can see that some have different “standard” names.

df[! df$Antigen == df$Antigen_std, ]

## # A tibble: 2 × 4
##   Antigen_std Original     Antigen n_matched
##   <chr>       <chr>        <chr>       <int>
## 1 CD272       BTLA_PROT    BTLA           12
## 2 HLA-A,B,C   HLA-ABC_PROT HLA-ABC        15

By looking at the n_matched column, we can see how many entries in the cite-seq data set (includeing the new query data) could be matched to each antigen. Here we see that only AnnexinV and the isotype controls could not be matched by name only. That is, the only match came from the Kotliarov data set itself.

head(df[order(df$n_matched),], 10)

## # A tibble: 10 × 4
##    Antigen_std            Original                    Antigen            n_mat…¹
##    <chr>                  <chr>                       <chr>                <int>
##  1 AnnexinV               AnnexinV_PROT               AnnexinV                 1
##  2 MouseIgG1kappaisotype  MouseIgG1kappaisotype_PROT  MouseIgG1kappaiso…       1
##  3 MouseIgG2akappaisotype MouseIgG2akappaisotype_PROT MouseIgG2akappais…       1
##  4 Mouse IgG2bkIsotype    Mouse IgG2bkIsotype_PROT    Mouse IgG2bkIsoty…       1
##  5 RatIgG2bkIsotype       RatIgG2bkIsotype_PROT       RatIgG2bkIsotype         1
##  6 CD13                   CD13_PROT                   CD13                     9
##  7 CX3CR1                 CX3CR1_PROT                 CX3CR1                   9
##  8 IgA                    IgA_PROT                    IgA                     10
##  9 CD18                   CD18_PROT                   CD18                    11
## 10 CD70                   CD70_PROT                   CD70                    11
## # … with abbreviated variable name ¹n_matched

Rename single cell experiment

# To rename the singleCellExperiment, we need a named vector of the new names,
# with names being the original names of the singleCellExperiment.

# "structure" is a method that allows us to create a named vector in one step
new_nms <- structure(df[[2]], names = rownames(kotliarov))
print(head(new_nms))

##   AnnexinV_PROT       BTLA_PROT      CD117_PROT      CD123_PROT       CD13_PROT 
## "AnnexinV_PROT"     "BTLA_PROT"    "CD117_PROT"    "CD123_PROT"     "CD13_PROT" 
##      CD133_PROT 
##    "CD133_PROT"

kotliarov <- renameADT(kotliarov, new_nms)
head(rownames(kotliarov))

## [1] "AnnexinV_PROT" "BTLA_PROT"     "CD117_PROT"    "CD123_PROT"   
## [5] "CD13_PROT"     "CD133_PROT"

Example 2: CiteFuse example data

In this example, the CITE-seq data are stored as an altExp. The procedure for standardising the antibody names is the same.

Matching directly to the CITEseq data set

# Load CiteFuse example data

library(CiteFuse)
data("CITEseq_example", package = "CiteFuse")
sce_citeseq <- preprocessing(CITEseq_example)
rownames(altExp(sce_citeseq, "ADT"))

##  [1] "B220 (CD45R)"      "B7-H1 (PD-L1)"     "C-kit (CD117)"    
##  [4] "CCR7"              "CD11b"             "CD11c"            
##  [7] "CD138"             "CD14"              "CD16"             
## [10] "CD19"              "CD1a"              "CD2"              
## [13] "CD223 (lag3)"      "CD24"              "CD26 (Adenosine)" 
## [16] "CD27"              "CD28"              "CD3"              
## [19] "CD34"              "CD366 (tim3)"      "CD4"              
## [22] "CD44"              "CD45"              "CD45RA"           
## [25] "CD45RO"            "CD5"               "CD56"             
## [28] "CD62L"             "CD66b"             "CD69"             
## [31] "CD7"               "CD77"              "CD8"              
## [34] "CTLA4"             "EpCAM (CD326)"     "HLA-A,B,C"        
## [37] "IL7Ralpha (CD127)" "IgG1"              "IgG2a"            
## [40] "LAMP1"             "MHCII (HLA-DR)"    "Ox40 (CD134)"     
## [43] "PD-1 (CD279)"      "PD-L1 (CD274)"     "PD1 (CD279)"      
## [46] "PECAM (CD31)"      "Siglec-8"          "TCRb"             
## [49] "TCRg"

df <- data.frame(Antigen = rownames(altExp(sce_citeseq, "ADT")))
df <- matchToCiteseq(df)

# Print entries that differ from the original
df[! df$Antigen == df$Antigen_std, ]

## # A tibble: 21 × 3
##    Antigen_std Antigen           n_matched
##    <chr>       <chr>                 <int>
##  1 B220        B220 (CD45R)              6
##  2 B7-H1       B7-H1 (PD-L1)             1
##  3 CD117       C-kit (CD117)            18
##  4 CD197       CCR7                     26
##  5 CD223       CD223 (lag3)             18
##  6 CD26        CD26 (Adenosine)         17
##  7 CD366       CD366 (tim3)             18
##  8 CD152       CTLA4                    18
##  9 CD326       EpCAM (CD326)            11
## 10 CD127       IL7Ralpha (CD127)        29
## # … with 11 more rows

# Print the entries with the fewest matches
head(df[order(df$n_matched),], 10)

## # A tibble: 10 × 3
##    Antigen_std Antigen       n_matched
##    <chr>       <chr>             <int>
##  1 B7-H1       B7-H1 (PD-L1)         1
##  2 CD77        CD77                  3
##  3 B220        B220 (CD45R)          6
##  4 Siglec-8    Siglec-8              6
##  5 CD326       EpCAM (CD326)        11
##  6 CD1a        CD1a                 12
##  7 CD107a      LAMP1                12
##  8 CD66b       CD66b                14
##  9 CD134       Ox40 (CD134)         14
## 10 TCR a/b     TCRb                 14

Here we see that the consensus name for “IgG1” is “Mouse IgG1, kappa isotype Ctrl”. This is because when other studies refer to “IgG1”, we can tell from the antibody clone ID that they are referring to an isotype control with unknown specificity. Without further information, we cannot tell whether the antibody in the CiteFuse example data is an isotype control or an anti-human secondary antibody.

Similarly for TCRa and TCRb, we can tell by matching antigens by clone ID that these are more commonly referred to as TCR a/b (alpha/beta) and TCR g/d (gamma/delta) respectively.

Check matching entries in CITEseq data set

Suppose, as in the above example, we are not sure if a consensus name is correct. We may wish to examine the matching entries in the citeseq data set. One way to do this is to load the citeseq data set and check the matching entries using the function ``.

# We load dplyr to simplify data.frame manipulations.
library(dplyr)

# Load citeseq data set
data(citeseq)

tcrb <- citeseq %>%
    # We will first select entries that equal "TCRb"
    dplyr::filter(Antigen == "TCRb") %>%
    # Then subset to just the columns we want to use for matching
    dplyr::select(Antigen, Clone, Cat_Number, ALT_ID)

# Then we will find entries in citeseq that match any of these columns.

# filter_by_union is a function to return rows where any the value of any
# column occurs in a reference data.frame
citeseq %>%
    filter_by_union(tcrb) %>%
    # Select just columns of interest for viewing
    dplyr::select(Study, Antigen, Clone, Cat_Number, ALT_ID)

##                Study                    Antigen Clone Cat_Number
## 1           Hao_2021             TCR alpha/beta  IP26     306737
## 2           Liu_2021             TCR alpha/beta  IP26     306743
## 3        Nathan_2021 TCR alpha/beta (TRAC/TRBC)  IP26     306737
## 4       Mimitou_2021                    TCR a/b  IP26     306737
## 5          Qian_2020                    TCR a/b  IP26     306737
## 6    Stephenson_2021                    TCR a/b  IP26     306743
## 7            Su_2020                    TCR a/b  IP26     306743
## 8         LeCoz_2021                    TCR a/b  IP26     306743
## 9            Wu_2021                      TCRab  IP26     306737
## 10 PomboAntunes_2021                      TCRab  IP26     306737
## 11         Buus_2021                      TCRab  IP26     306743
## 12          Hao_2021                       TCRb  IP26       <NA>
## 13      Mimitou_2019                       TCRb  IP26       <NA>
##                    ALT_ID
## 1  HGNC:12027, HGNC:12155
## 2  HGNC:12027, HGNC:12155
## 3  HGNC:12027, HGNC:12155
## 4  HGNC:12027, HGNC:12155
## 5  HGNC:12027, HGNC:12155
## 6  HGNC:12027, HGNC:12155
## 7  HGNC:12027, HGNC:12155
## 8  HGNC:12027, HGNC:12155
## 9  HGNC:12027, HGNC:12155
## 10 HGNC:12027, HGNC:12155
## 11 HGNC:12027, HGNC:12155
## 12             HGNC:12155
## 13             HGNC:12155

From the above result we can confirm that two studies have an antigen named “TCRb”. By matching on antibody clone, we see that it is more commonly referred to as TCRab or TCR alpha/beta.

A note about the ALT_IDs

The column ALT_ID in the data set gene_aliases is an ID that allows the antigens in the citeseq data set to be distinguished. If the HGNC ID is sufficient, then the ALT_ID will be the HGNC_ID. If not, it may be an ID from e.g. the protein ontology or the National Cancer Institute thesaurus, or if no stable ID was found, the antibody and clone name.

Session Info

sessionInfo()

## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] dplyr_1.0.10                CiteFuse_1.8.0             
##  [3] AbNames_0.2.0               scRNAseq_2.10.0            
##  [5] SingleCellExperiment_1.18.0 SummarizedExperiment_1.26.1
##  [7] Biobase_2.56.0              GenomicRanges_1.48.0       
##  [9] GenomeInfoDb_1.32.4         IRanges_2.30.1             
## [11] S4Vectors_0.34.0            BiocGenerics_0.42.0        
## [13] MatrixGenerics_1.8.1        matrixStats_0.62.0         
## [15] BiocStyle_2.24.0           
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.2.2                    tidyselect_1.1.2             
##   [3] RSQLite_2.2.17                AnnotationDbi_1.58.0         
##   [5] grid_4.2.1                    BiocParallel_1.30.3          
##   [7] Rtsne_0.16                    munsell_0.5.0                
##   [9] ScaledMatrix_1.4.1            codetools_0.2-18             
##  [11] ragg_1.2.2                    statmod_1.4.37               
##  [13] scran_1.24.1                  withr_2.5.0                  
##  [15] colorspace_2.0-3              filelock_1.0.2               
##  [17] knitr_1.40                    GenomeInfoDbData_1.2.8       
##  [19] polyclip_1.10-0               bit64_4.0.5                  
##  [21] farver_2.1.1                  pheatmap_1.0.12              
##  [23] rhdf5_2.40.0                  rprojroot_2.0.3              
##  [25] vctrs_0.4.1                   generics_0.1.3               
##  [27] xfun_0.33                     BiocFileCache_2.4.0          
##  [29] randomForest_4.7-1.1          R6_2.5.1                     
##  [31] graphlayouts_0.8.1            rsvd_1.0.5                   
##  [33] locfit_1.5-9.6                AnnotationFilter_1.20.0      
##  [35] bitops_1.0-7                  rhdf5filters_1.8.0           
##  [37] cachem_1.0.6                  DelayedArray_0.22.0          
##  [39] assertthat_0.2.1              promises_1.2.0.1             
##  [41] BiocIO_1.6.0                  scales_1.2.1                 
##  [43] ggraph_2.0.6                  gtable_0.3.1                 
##  [45] beachmat_2.12.0               tidygraph_1.2.2              
##  [47] ensembldb_2.20.2              rlang_1.0.5                  
##  [49] systemfonts_1.0.4             splines_4.2.1                
##  [51] rtracklayer_1.56.1            lazyeval_0.2.2               
##  [53] BiocManager_1.30.18           yaml_2.3.5                   
##  [55] reshape2_1.4.4                GenomicFeatures_1.48.3       
##  [57] httpuv_1.6.6                  tools_4.2.1                  
##  [59] bookdown_0.29                 ggplot2_3.3.6                
##  [61] ellipsis_0.3.2                jquerylib_0.1.4              
##  [63] RColorBrewer_1.1-3            ggridges_0.5.3               
##  [65] Rcpp_1.0.9                    plyr_1.8.7                   
##  [67] sparseMatrixStats_1.8.0       progress_1.2.2               
##  [69] zlibbioc_1.42.0               purrr_0.3.4                  
##  [71] RCurl_1.98-1.8                prettyunits_1.1.1            
##  [73] dbscan_1.1-10                 viridis_0.6.2                
##  [75] cowplot_1.1.1                 ggrepel_0.9.1                
##  [77] cluster_2.1.4                 fs_1.5.2                     
##  [79] magrittr_2.0.3                ProtGenerics_1.28.0          
##  [81] hms_1.1.2                     mime_0.12                    
##  [83] evaluate_0.16                 xtable_1.8-4                 
##  [85] XML_3.99-0.10                 gridExtra_2.3                
##  [87] compiler_4.2.1                biomaRt_2.52.0               
##  [89] tibble_3.1.8                  crayon_1.5.1                 
##  [91] htmltools_0.5.3               segmented_1.6-0              
##  [93] later_1.3.0                   propr_4.2.6                  
##  [95] tidyr_1.2.1                   DBI_1.1.3                    
##  [97] tweenr_2.0.2                  ExperimentHub_2.4.0          
##  [99] dbplyr_2.2.1                  MASS_7.3-58.1                
## [101] rappdirs_0.3.3                Matrix_1.5-1                 
## [103] cli_3.4.0                     metapod_1.4.0                
## [105] parallel_4.2.1                igraph_1.3.4                 
## [107] pkgconfig_2.0.3               pkgdown_2.0.6                
## [109] GenomicAlignments_1.32.1      scuttle_1.6.3                
## [111] xml2_1.3.3                    bslib_0.4.0                  
## [113] dqrng_0.3.0                   XVector_0.36.0               
## [115] stringr_1.4.1                 digest_0.6.29                
## [117] Biostrings_2.64.1             rmarkdown_2.16               
## [119] uwot_0.1.14                   edgeR_3.38.4                 
## [121] DelayedMatrixStats_1.18.0     restfulr_0.0.15              
## [123] curl_4.3.2                    kernlab_0.9-31               
## [125] shiny_1.7.2                   Rsamtools_2.12.0             
## [127] rjson_0.2.21                  lifecycle_1.0.2              
## [129] nlme_3.1-159                  jsonlite_1.8.0               
## [131] Rhdf5lib_1.18.2               BiocNeighbors_1.14.0         
## [133] desc_1.4.2                    viridisLite_0.4.1            
## [135] limma_3.52.3                  fansi_1.0.3                  
## [137] pillar_1.8.1                  lattice_0.20-45              
## [139] KEGGREST_1.36.3               fastmap_1.1.0                
## [141] httr_1.4.4                    survival_3.4-0               
## [143] interactiveDisplayBase_1.34.0 glue_1.6.2                   
## [145] png_0.1-7                     bluster_1.6.0                
## [147] BiocVersion_3.15.2            bit_4.0.4                    
## [149] ggforce_0.3.4                 stringi_1.7.8                
## [151] sass_0.4.2                    mixtools_1.2.0               
## [153] blob_1.2.3                    textshaping_0.3.6            
## [155] BiocSingular_1.12.0           AnnotationHub_3.4.0          
## [157] memoise_2.0.1                 irlba_2.3.5

Helen Lindsay

2022-07-11