Introduction to AbNames

Overview

AbNames performs three tasks:

Formatting antibody names for querying a gene aliases,
Matching human antibody names to gene names, IDs and protein complex IDs, and
Standardising antibody names to match a reference data set

Antibody names are not reported in a consistent format in published data. Antibodies are often named according to the antigen they target, which may not be the same as the name of the protein (complex) the antigen is part of. Antibodies may target multi-subunit protein complexes, and this can be reflected in the name, e.g. an antibody against the T-cell receptor alpha and beta subunits might be named TCRab. Antibody names may also include the name of the clone the antibody is derived from or the names of fluorophores or DNA-oligos the antibody is conjugated with. For these reasons, it can be difficult to exactly match antibody names with gene or protein names. As cell surface antigens often have very similar names, searching for partial matches to names in free-text gene descriptions is challenging and error-prone.

Data

AbNames contains several curated gene name data sets for matching to antibody names.

Gene aliases

The gene_aliases data set is primarily based on the protein-coding and gene groups tables from the Human Gene Names Consortium (HGNC). These include previous (obsolete) gene names and aliases, which in our experience have been useful for matching to antibody names. Non-ambiguous gene aliases from Ensembl (fetched via Bioconductor package biomaRt) and NCBI (fetched via the NCBI ftp site and the Bioconductor package org.Hs.eg.db), and non-ambiguous proteins from the Cell-Surface Protein Atlas (CSPA, Cell-Surface Protein Atlas) and Cellmarker protein database (http://xteam.xbio.top/CellMarker/) have been added. Mappings between HGNC, Ensembl and NCBI (Entrez) IDs are mostly based on HGNC, with some corrections of obsolete Ensembl IDs using the Ensembl data.

The gene_aliases data set is in long format (one alias per row).

Load using:

library(AbNames)
library(dplyr)

data("gene_aliases", package = "AbNames") 

# Show the first entries of gene_aliases,
# where each row is the start of one column
dplyr::glimpse(gene_aliases) 
#> Rows: 131,444
#> Columns: 10
#> $ HGNC_ID     <chr> "HGNC:100", "HGNC:100", "HGNC:100", "HGNC:100", "HGNC:100"…
#> $ ENSEMBL_ID  <chr> "ENSG00000110881", "ENSG00000110881", "ENSG00000110881", "…
#> $ UNIPROT_ID  <chr> "P78348", "P78348", "P78348", "P78348", "P78348", "P78348"…
#> $ HGNC_SYMBOL <chr> "ASIC1", "ASIC1", "ASIC1", "ASIC1", "ASIC1", "ASIC1", "ASI…
#> $ ENTREZ_ID   <chr> "41", "41", "41", "41", "41", "41", "41", "41", "41", "599…
#> $ BIOTYPE     <chr> "protein_coding", "protein_coding", "protein_coding", "pro…
#> $ symbol_type <chr> "ALIAS", "ALIAS", "HGNC_NAME", "HGNC_SYMBOL", "PREVIOUS_NA…
#> $ value       <chr> "BNaC2", "hBNaC2", "acid sensing ion channel subunit 1", "…
#> $ ALT_ID      <chr> "HGNC:100", "HGNC:100", "HGNC:100", "HGNC:100", "HGNC:100"…
#> $ SOURCE      <chr> "HGNC", "HGNC", "HGNC", "HGNC", "HGNC", "HGNC", "HGNC", "H…

# (Note: it isn't necessary to use dplyr:: to call "glimpse" as dplyr is loaded
# with the library call above. This syntax is used to make it clear which
# packages functions belong to)

BioLegend antibodies

BioLegend is a major supplier of antibodies, and provides several antibody panels for CITE-seq analyses. The data set “totalseq” is a re-formatted version of the TotalSeq barcodes data sheets available from the BioLegend website, including BioLegend antibody names and Ensembl gene IDs. The isotypes of the antibodies are not included in the TotalSeq data sheets, and only human antibodies and isotype controls are included. Missing and incomplete Ensembl IDs have been manually corrected.

Load using:

data("totalseq", package = "AbNames")
dplyr::glimpse(totalseq)
#> Rows: 977
#> Columns: 13
#> $ Cat_Number       <chr> "305239", "305443", "329743", "329619", "309413", "33…
#> $ Oligo_ID         <chr> "0005", "0006", "0007", "0008", "0009", "0010", "0014…
#> $ Antigen          <chr> "CD80", "CD86", "CD274", "CD273", "CD275", "CD276", "…
#> $ Clone            <chr> "2D10", "IT2.2", "29E.2A3", "24F.10C12", "2D3", "DCN.…
#> $ TotalSeq_Cat     <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"…
#> $ Reactivity       <chr> "Human", "Human, African Green, Baboon, Capuchin Monk…
#> $ Cross_Reactivity <chr> "Rhesus", NA, NA, NA, NA, NA, "Chimpanzee, Baboon, Cy…
#> $ Barcode_Sequence <chr> "ACGAATCAATCTGTG", "GTCTTTGTCAGTGCA", "GTTGTCCGACAATA…
#> $ Date_Released    <chr> "07/13/2018", "06/08/2018", "07/24/2018", "06/08/2018…
#> $ ENSEMBL_ID       <chr> "ENSG00000121594", "ENSG00000114013", "ENSG0000012021…
#> $ HGNC_ID          <chr> "HGNC:1700", "HGNC:1705", "HGNC:17635", "HGNC:18731",…
#> $ HGNC_SYMBOL      <chr> "CD80", "CD86", "CD274", "PDCD1LG2", "ICOSLG", "CD276…
#> $ ALT_ID           <chr> "HGNC:1700", "HGNC:1705", "HGNC:17635", "HGNC:18731",…

Note that the “Antigen” column here refers to antibody names with prefixes such as “anti-human” removed.

CITE-seq antibodies

This is a table matching antibody names to gene and protein IDs from >20 data sets with publicly available CITE-seq data. The data sets are data that we have worked with and we would be happy to add other data sets if provided. The AbNames package collects the functions that were used to create this table. This table (NOT IN THERE YET) includes manually curated matches between antibody names and gene IDs for cases where we were unable to find an exact match to the antibody name provided. Isotype controls were manually identified based on either the name, e.g. “IgG1 Isotype Ctrl” or the reactivity, e.g. “Mouse IgG2b”.

Load using

# As the citeseq data set contains raw data, it is loaded differently
# than the other data sets 

citeseq_fname <- system.file("extdata", "citeseq.csv", package = "AbNames")
citeseq <- read.csv(citeseq_fname) %>% unique()
dplyr::glimpse(citeseq)
#> Rows: 2,760
#> Columns: 11
#> $ Antigen         <chr> "2B4", "4.1BB", "4.1BBL", "Annexin V", "anti-c-Met", "…
#> $ Study           <chr> "Wu_2021", "Wu_2021", "Wu_2021", "Kotliarov_2020", "Li…
#> $ Clone           <chr> "C1.7", "4B4-1", "5F4", NA, "12.1", "12.1", "PE001", "…
#> $ Cat_Number      <chr> "329527", "309835", "311509", "custom made (similar to…
#> $ Oligo_ID        <chr> "0189", "0355", "0022", "0025", "1055", NA, "0911", "0…
#> $ TotalSeq_Cat    <chr> "A", "A", "A", "A", "C", NA, "A", "A", "C", "A", "A", …
#> $ Vendor          <chr> "BioLegend", "BioLegend", "BioLegend", "BioLegend", "B…
#> $ Control         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ RRID            <chr> "AB_2750007", "AB_2783173", "AB_2734284", NA, NA, NA, …
#> $ Lot             <chr> NA, NA, NA, "B270560", NA, NA, NA, NA, NA, NA, NA, "B2…
#> $ Custom_Antibody <lgl> NA, NA, NA, TRUE, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

# TO DO: AS THIS IS ONLY A SUBSET OF THE CITESEQ INFO, THERE ARE APPARENT
# DUPLICATES, DECIDE WHAT TO DO WITH THESE

Here the column “Antigen” refers to the name given by the authors of the study in the table of resources used. Some minimal formatting has been done, for example fixing Greek characters that were accidentally transformed upon data import.

Creating a query table

To illustrate the problem with matching antibody names to gene names, let’s look at an example from the citeseq data set.

# The regular expression inside "grepl" searches for PD.L1, PD-L1, PDL1 or CD274

cd274 <- citeseq %>%
    dplyr::filter(grepl("PD[\\.-]?L1|CD274", Antigen)) %>%
    dplyr::pull(Antigen) %>%
    unique()

cd274
#> [1] "CD274"                "CD274 (B7-H1, PD-L1)" "CD274 (PD-L1)"       
#> [4] "PD-L1"                "PD-L1 (CD274)"        "PDL1"                
#> [7] "PDL1 (CD274)"

All of the antigen names above refer to the same cell surface protein.

Default query table

We will demonstrate how to create a default query table using the raw CITE-seq data set. This collects information from the reagents tables provided as supplementary material in the studies used.

# Add an ID column to the citeseq data set to allow the results table to be merged
citeseq <- AbNames::addID(citeseq)
#> ID columns do not uniquely identify rows, row numbers added.

# Remove control columns, as we are only searching human proteins
controls <- dplyr::filter(citeseq, Control)
citeseq <- dplyr::filter(citeseq, ! Control)

# Select just the columns that are needed for querying:
citeseq_q <- citeseq %>% dplyr::select(ID, Antigen)

# Apply the default transformation sequence and make a query table in long format
query_df <- AbNames::makeQueryTable(citeseq_q, ab = "Antigen")

# Print one example antigen from the query table:
query_df %>%
    dplyr::filter(ID == "TCR alpha/beta__Hao_2021")
#> # A tibble: 6 × 3
#>   ID                       name          value                      
#>   <chr>                    <chr>         <chr>                      
#> 1 TCR alpha/beta__Hao_2021 TCR_long      T cell receptor alpha locus
#> 2 TCR alpha/beta__Hao_2021 TCR_long      T cell receptor beta locus 
#> 3 TCR alpha/beta__Hao_2021 lower_no_dash tcr a/b                    
#> 4 TCR alpha/beta__Hao_2021 greek_letter  TCR a/b                    
#> 5 TCR alpha/beta__Hao_2021 upper_no_dash TCR A/B                    
#> 6 TCR alpha/beta__Hao_2021 Antigen       TCR alpha/beta

The example above shows that each antibody name has been reformatted in several ways. When querying a data set, we search for an exact match to any of these strings.

Custom query table

AbNames provides a few template functions that can be used to construct a pipeline for creating a query table. The default query table uses the function defaultQuery to create a list of partial functions which are then applied recursively to the data.frame using magrittr::freduce. To use this strategy, all functions must accept the data.frame as the only required argument. Other arguments should be filled in when creating the partial functions.

The default function list can used as a starting point to add extra formatting functions, remove or modify certain steps. The individual formatting functions are introduced below (SO FAR ONLY THE T-CELL RECEPTORS)

# Get the default sequence of formatting functions 
default_funs <- AbNames::defaultQuery()

# Print the first two formatting functions as an example
default_funs[1:2]
#> [[1]]
#> <partialised>
#> function (...) 
#> gsubAb(ab = "Antigen", ...)
#> 
#> [[2]]
#> <partialised>
#> function (...) 
#> gsubAb(ab = "Antigen", pattern = "\\s[Rr]ecombinant", ...)

T-cell receptors

We found that querying the HGNC gene description was the easiest way to match the names of antibodies against the T-cell receptor to their gene IDs. (An alternative method is to search for the gene symbol.)

We will start with a data.frame of antibodies against the T-cell receptor from the CITE-seq data set. We have already done some formatting of these names, e.g. for “TCR Va24-Ja18 (iNKT cell)” we have removed the section in brackets. We wish to create a query data.frame where each subunit of the TCR complex appears on a separate line.

Note: this function may cause false matches if the numbering of the gene name does not match that of the antibody.

tcr <- data.frame(Antigen =
                      c("TCR alpha/beta", "TCRab", "TCR gamma/delta", "TCRgd",
                        "TCR g/d", "TCR Vgamma9", "TCR Vg9", "TCR Vd2",
                        "TCR Vdelta2", "TCR Vα24-Jα18", "TCRVa24.Ja18", 
                        "TCR Valpha24-Jalpha18", "TCR Vα7.2", "TCR Va7.2",
                        "TCRa7.2", "TCRVa7.2", "TCR Vbeta13.1", "TCR γ/δ", 
                        "TCR Vβ13.1", "TCR Vγ9", "TCR Vδ2", "TCR α/β",
                        "TCRb", "TCRg"))


# First, we convert the Greek symbols to letters.  
# Note that as we are using replaceGreekSyms in a dplyr pipeline, we don't 
# put quotes around the column name, i.e. Antigen not "Antigen".

tcr <- tcr %>%
    dplyr::mutate(query = 
                    AbNames::replaceGreekSyms(Antigen, replace = "sym2letter"))
 
# Print out a few rows to see the result
tcr %>% 
    dplyr::filter(Antigen %in% c("TCR Vβ13.1", "TCR Vδ2", "TCR α/β"))
#>      Antigen      query
#> 1 TCR Vβ13.1 TCR Vb13.1
#> 2    TCR Vδ2    TCR Vd2
#> 3    TCR α/β    TCR a/b

Now we use the function formatTCR to format the antibody names for querying the gene description field of the HGNC data set.

tcr_f <- AbNames::formatTCR(tcr, tcr = "query")

# Print out the first few rows
tcr_f %>%
    head() 
#>           Antigen           query                    TCR_long
#> 1  TCR alpha/beta  TCR alpha/beta T cell receptor alpha locus
#> 2  TCR alpha/beta  TCR alpha/beta  T cell receptor beta locus
#> 3           TCRab           TCRab T cell receptor alpha locus
#> 4           TCRab           TCRab  T cell receptor beta locus
#> 5 TCR gamma/delta TCR gamma/delta T cell receptor gamma locus
#> 6 TCR gamma/delta TCR gamma/delta T cell receptor delta locus

Querying a dataset

Our pipeline is to query the HGNC and then use the other datasets and the antibody vendor information to find matches for the unmatched antibodies.

By default, we require that matches must be found for all subunits of a multi-subunit protein to avoid incorrect matches.

Here we will query the HGNC data set for the CITE-seq antibodies, using the query table created above.

alias_results <- searchAliases(query_df)
#> Joining, by = "value"

# Print 10 random results:
alias_results %>%
    dplyr::select(ID, name, value, symbol_type) %>%
    dplyr::ungroup() %>%
    dplyr::sample_n(10)
#> # A tibble: 10 × 4
#>    ID                            name                        value       symbo…¹
#>    <chr>                         <chr>                       <chr>       <chr>  
#>  1 CD29 (mouse)__Mimitou_2019    Antigen_split               CD29        ALIAS  
#>  2 CD85j (ILT2)__Stephenson_2021 Antigen_split|upper_no_dash CD85j|CD85… ALIAS  
#>  3 CD27__Stuart_2019             Antigen                     CD27        HGNC_S…
#>  4 CD45RO__Mimitou_2019          Antigen                     CD45RO      NA     
#>  5 Siglec-8__Hao_2021__1         upper_no_dash               SIGLEC8     HGNC_S…
#>  6 CD137L__PomboAntunes_2021     Antigen                     CD137L      ALIAS  
#>  7 CX3CR1__Qian_2020             Antigen                     CX3CR1      HGNC_S…
#>  8 IgM__Mimitou_2021             Ig                          IGHM        HGNC_S…
#>  9 CD335 (NKp46)__Mimitou_2021   Antigen_split               NKp46       CELLMA…
#> 10 CD268 (BAFF-R)__LeCoz_2021    Antigen_split|upper_no_dash BAFF-R|BAF… ALIAS  
#> # … with abbreviated variable name ¹symbol_type

The results table above contains (just) the matches between the query table and the HGNC table. In the example above, we see in the “name” column the formatting function that generated the string in the “value” column that was matched, and in the “symbol_type” column which column of the HGNC data was matched. If there are multiple matches for a given ID, the official symbol (“HGNC_SYMBOL”) is preferred over aliases or previous symbols.

We may wish to review the matches before merging the results into the original table. For example, we can check for matches to different genes where the name was not guessed to be a multi-subunit protein.

alias_results <- alias_results %>%
    dplyr::group_by(ID) %>%
    dplyr::mutate(n_ids = dplyr::n_distinct(HGNC_ID)) # Count distinct IDs
    
# Select antigens where there are matches to multiple genes but not
# because the antibody is against a multi-gene protein 
multi_gene <- alias_results %>%
    dplyr::filter(n_ids > 1, ! all(name %in% c("TCR_long", "subunit"))) %>%
    dplyr::select(ID, name, value, HGNC_ID)

# Look at the first group in multi_gene.
# Set interactive = FALSE for interactive exploration
showGroups(multi_gene, 1, interactive = FALSE)
#> Group 1 of 10: 3 rows
#>                                      ID          name value   HGNC_ID
#> 1 CD11a_CD18 (LFA-1)__PomboAntunes_2021       Antigen CD11a HGNC:6148
#> 2 CD11a_CD18 (LFA-1)__PomboAntunes_2021 upper_no_dash CD11A HGNC:6148
#> 3 CD11a_CD18 (LFA-1)__PomboAntunes_2021 Antigen_split  CD18 HGNC:6155

# This is an example where the antibody is against a heterodimeric protein.
# We can confirm this by looking up the vendor catalogue number:
citeseq %>%
    dplyr::filter(ID == "CD11a.CD18-Wu_2021_b") %>%
    dplyr::select(Antigen, Cat_Number, Vendor)
#> Adding missing grouping variables: `ID`
#> # A tibble: 0 × 4
#> # Groups:   ID [0]
#> # … with 4 variables: ID <chr>, Antigen <chr>, Cat_Number <chr>, Vendor <chr>

NOTE: I haven’t written a convenience function for merging back into the original table yet. TODO: * QUERY FOR CD11/CD18 in protein ontology * Remove redundant results because of Antigen/greek_letter e.g. “KLRG1 (MAFA)-Qian_2020”

Examples to discuss?

“KIR2DL5” matches two genes HGNC:16345 HGNC:16346 "DR3 (TRAMP)__Liu_2021" - tramp alias is ambiguous

For now I will remove all genes with multiple matches before merging the results into the citeseq data.

nrow(alias_results)
#> [1] 2900

id_cols <- c("HGNC_ID", "HGNC_SYMBOL", "ENSEMBL_ID", "UNIPROT_ID")

# Remove matches to several genes, select just columns of interest

alias_results %>%
    dplyr::select(matches("ID|HGNC"), name) %>% # Select ID and HGNC columns
    unique() %>% # Collapse results with same ID from different queries
    
    # Collapse multi-subunit entries, convert "NA" to NA
    dplyr::summarise(dplyr::across(all_of(id_cols), ~toString(unique(.x)))) %>%
    dplyr::mutate(dplyr::across(all_of(id_cols), ~na_if(.x, "NA")))   
#> # A tibble: 2,608 × 5
#>    ID                         HGNC_ID    HGNC_SYMBOL ENSEMBL_ID      UNIPROT_ID
#>    <chr>                      <chr>      <chr>       <chr>           <chr>     
#>  1 2B4__Wu_2021               HGNC:18171 CD244       ENSG00000122223 Q9BZW8    
#>  2 4.1BB__Wu_2021             HGNC:11924 TNFRSF9     ENSG00000049249 Q07011    
#>  3 4.1BBL__Wu_2021            HGNC:11939 TNFSF9      ENSG00000125657 P41273    
#>  4 Annexin V__Kotliarov_2020  HGNC:543   ANXA5       ENSG00000164111 P08758    
#>  5 B220 (CD45R)__Hao_2021     HGNC:9666  PTPRC       NA              NA        
#>  6 B220 (CD45R)__Mimitou_2019 HGNC:9666  PTPRC       NA              NA        
#>  7 B7-H4__Hao_2021            HGNC:28873 VTCN1       ENSG00000134258 Q7Z7D3    
#>  8 B7-H4__Liu_2021            HGNC:28873 VTCN1       ENSG00000134258 Q7Z7D3    
#>  9 B7-H4__Mimitou_2021        HGNC:28873 VTCN1       ENSG00000134258 Q7Z7D3    
#> 10 B7-H4__Qian_2020           HGNC:28873 VTCN1       ENSG00000134258 Q7Z7D3    
#> # … with 2,598 more rows

nrow(alias_results)
#> [1] 2900

citeseq <- citeseq %>%
    dplyr::left_join(alias_results, by = "ID") %>%
    dplyr::relocate(ID, Antigen, Cat_Number, HGNC_ID) %>%
    unique()

head(citeseq)
#> # A tibble: 6 × 25
#> # Groups:   ID [6]
#>   ID    Antigen Cat_N…¹ HGNC_ID Study Clone Oligo…² Total…³ Vendor Control RRID 
#>   <chr> <chr>   <chr>   <chr>   <chr> <chr> <chr>   <chr>   <chr>  <lgl>   <chr>
#> 1 2B4_… 2B4     329527  HGNC:1… Wu_2… C1.7  0189    A       BioLe… FALSE   AB_2…
#> 2 4.1B… 4.1BB   309835  HGNC:1… Wu_2… 4B4-1 0355    A       BioLe… FALSE   AB_2…
#> 3 4.1B… 4.1BBL  311509  HGNC:1… Wu_2… 5F4   0022    A       BioLe… FALSE   AB_2…
#> 4 Anne… Annexi… custom… HGNC:5… Kotl… NA    0025    A       BioLe… FALSE   NA   
#> 5 anti… anti-c… NA      NA      Liu_… 12.1  1055    C       BioLe… FALSE   NA   
#> 6 anti… anti-c… NA      NA      Step… 12.1  NA      NA      BioLe… FALSE   NA   
#> # … with 14 more variables: Lot <chr>, Custom_Antibody <lgl>, name <chr>,
#> #   value <chr>, ENSEMBL_ID <chr>, UNIPROT_ID <chr>, HGNC_SYMBOL <chr>,
#> #   ENTREZ_ID <chr>, BIOTYPE <chr>, symbol_type <chr>, ALT_ID <chr>,
#> #   SOURCE <chr>, n_matches <int>, n_ids <int>, and abbreviated variable names
#> #   ¹Cat_Number, ²Oligo_ID, ³TotalSeq_Cat

Filling missing information

Filling using the TotalSeq dataset

Another approach to annotating antibodies is to use the totalseq data. We are not confident about the assignment of antibodies to Ensembl identifiers. However, this may not be a problem if the aim is simply to standardise names between datasets or to a common reference.

The function searchTotalseq by default matches sequentially to columns catalogue number (Cat_Number), Antigen, then Clone. Matching columns may be configured, but column names in the query data set should match those in the totalseq data set.


table(is.na(citeseq$HGNC_ID))
#> 
#> FALSE  TRUE 
#>  2852   101
cs <- citeseq %>%
    searchTotalseq()
#> Joining, by = c("ID", "Antigen", "Cat_Number", "HGNC_ID", "Study", "Clone",
#> "Oligo_ID", "TotalSeq_Cat", "Vendor", "Control", "RRID", "Lot",
#> "Custom_Antibody", "name", "value", "ENSEMBL_ID", "UNIPROT_ID", "HGNC_SYMBOL",
#> "ENTREZ_ID", "BIOTYPE", "symbol_type", "ALT_ID", "SOURCE", "n_matches",
#> "n_ids")
table(is.na(cs$HGNC_ID))
#> 
#> FALSE  TRUE 
#>  2895    58

We can also try to annotate antibodies that could not be annotated using the HGNC by using the totalseq table. This could also be done as a first matching step as above. Here, we cannot add any more annotations.

count_missing <- function(df){
    dplyr::filter(df, is.na(HGNC_ID)) %>% nrow()
}

data(totalseq)
ts <- totalseq %>%
    dplyr::select(any_of(colnames(citeseq)))

missing <- count_missing(citeseq)

# Fill in IDs where Antigen, Oligo, Clone and TotalSeq category match
citeseq <- citeseq %>%
    dplyr::rows_patch(ts,
                      by = c("Antigen", "Oligo_ID", "Clone", "TotalSeq_Cat"),
                      unmatched = "ignore")

missing_after_ts <- count_missing(citeseq)

# Missing before filling:
missing
#> [1] 101

# Still missing after filling with TotalSeq:
missing_after_ts
#> [1] 80

Inspecting groups before filling

When filling in information from a reference data set, it can be useful to look at entries where for example annotations are inconsistent. showGroups is a function that allows users to interactively print group(s) from a grouped data.frame. We give a (non-interactive) example below.

Filling NAs using a reference data set

The function fillByGroup is used to fill in NAs in a grouped data.frame. It differs from tidyr::fill in its treatment of inconsistent values. Whereas tidyr::fill will fill using the first value, fillByGroup offers the option to fill in the most frequent value. This can be useful when filling antibody IDs given the antibody name and clone or catalogue number.

To fill using a reference data set, the strategy used is either to add a temporary ID column, join the two data.frames, fill and separate again using the ID column, or use dplyr::rows_patch. We will demonstrate the latter approach below.

# Before filling, check how many antibodies were matched.
original_nmatched <- count_missing(citeseq)

# Select some data to demonstrate filling:
# Get entries sharing the same catalogue number,
# where not every entry has a match
fill_demo <- citeseq %>%
    dplyr::group_by(Cat_Number) %>%
    dplyr::arrange(Cat_Number) %>%
    dplyr::filter(!is.na(Cat_Number), 
                  any(is.na(HGNC_ID)),
                  ! all(is.na(HGNC_ID)))

AbNames::showGroups(fill_demo, interactive = FALSE)
#> Group 1 of 21: 3 rows
#>                              ID    Antigen Cat_Number   HGNC_ID
#> 1 CD45R_B220__PomboAntunes_2021 CD45R_B220     103263 HGNC:9666
#> 2 CD45R-B220__PomboAntunes_2021 CD45R-B220     103263      <NA>
#> 3         CD45R/B220__Qian_2020 CD45R/B220     103263 HGNC:9666
#>               Study   Clone Oligo_ID TotalSeq_Cat    Vendor Control       RRID
#> 1 PomboAntunes_2021 RA3-6B2     0103            A BioLegend   FALSE AB_2734158
#> 2 PomboAntunes_2021 RA3-6B2     0103            A BioLegend   FALSE AB_2734158
#> 3         Qian_2020 RA3-6B2     0103            A BioLegend   FALSE       <NA>
#>    Lot Custom_Antibody    name value      ENSEMBL_ID UNIPROT_ID HGNC_SYMBOL
#> 1 <NA>              NA Antigen CD45R            <NA>       <NA>       PTPRC
#> 2 <NA>              NA    <NA>  <NA>            <NA>       <NA>        <NA>
#> 3 <NA>              NA Antigen CD45R ENSG00000081237       <NA>       PTPRC
#>   ENTREZ_ID BIOTYPE symbol_type          ALT_ID        SOURCE n_matches n_ids
#> 1      <NA>    <NA>        <NA> PTPRC/iso:CD45R MANUAL_LOOKUP         1     1
#> 2      <NA>    <NA>        <NA>            <NA>          <NA>        NA    NA
#> 3      <NA>    <NA>        <NA> PTPRC/iso:CD45R MANUAL_LOOKUP         1     1

In the above example, we can see that for antibodies with catalogue number “300475”, a match was found if the antibody was named “CD3E” but not if it was named “CD3”. We can fill in the missing information using fillByGroup:

fill_demo <- AbNames::fillByGroup(fill_demo, "Cat_Number",
                                  fill = c("HGNC_ID", "TotalSeq_Cat", "Vendor",
                                           "ENSEMBL_ID", "UNIPROT_ID"),
                                  multiple = "mode") %>%
    dplyr::group_by(Cat_Number) # Re-group as fillByGroup ungroups

# Print out the first group again
AbNames::showGroups(fill_demo, interactive = FALSE)
#> Group 1 of 21: 3 rows
#>                              ID    Antigen Cat_Number   HGNC_ID
#> 1 CD45R_B220__PomboAntunes_2021 CD45R_B220     103263 HGNC:9666
#> 2 CD45R-B220__PomboAntunes_2021 CD45R-B220     103263 HGNC:9666
#> 3         CD45R/B220__Qian_2020 CD45R/B220     103263 HGNC:9666
#>               Study   Clone Oligo_ID TotalSeq_Cat    Vendor Control       RRID
#> 1 PomboAntunes_2021 RA3-6B2     0103            A BioLegend   FALSE AB_2734158
#> 2 PomboAntunes_2021 RA3-6B2     0103            A BioLegend   FALSE AB_2734158
#> 3         Qian_2020 RA3-6B2     0103            A BioLegend   FALSE       <NA>
#>    Lot Custom_Antibody    name value      ENSEMBL_ID UNIPROT_ID HGNC_SYMBOL
#> 1 <NA>              NA Antigen CD45R ENSG00000081237       <NA>       PTPRC
#> 2 <NA>              NA    <NA>  <NA> ENSG00000081237       <NA>        <NA>
#> 3 <NA>              NA Antigen CD45R ENSG00000081237       <NA>       PTPRC
#>   ENTREZ_ID BIOTYPE symbol_type          ALT_ID        SOURCE n_matches n_ids
#> 1      <NA>    <NA>        <NA> PTPRC/iso:CD45R MANUAL_LOOKUP         1     1
#> 2      <NA>    <NA>        <NA>            <NA>          <NA>        NA    NA
#> 3      <NA>    <NA>        <NA> PTPRC/iso:CD45R MANUAL_LOOKUP         1     1

In the above call to fillByGroup, we set multiple = “mode”

Now we will fill the gene information in the citeseq data similarly, and check how many antibodies have not been matched.

# We fill by grouping the Catalogue number:
citeseq <- citeseq %>%
     AbNames::fillByGroup("Cat_Number", multiple = "mode",
                          fill = c("HGNC_ID", "TotalSeq_Cat", "Vendor",
                                   "ENSEMBL_ID", "UNIPROT_ID"))

nmatched_after_fill <- count_missing(citeseq)

print("Before filling:")
#> [1] "Before filling:"
original_nmatched
#> [1] 80
print("After filling:")
#> [1] "After filling:"
nmatched_after_fill
#> [1] 37

Notes about the gene aliases data set

The gene aliases data set is based on the annotation from the Human Genome Naming Consortium (HGNC). Annotation databases such as Ensembl and Entrez perform mapping between gene models independently, using different criteria. This means that they do not always agree on mappings between identifiers. In the gene aliases data set the HGNC mappings are given, and were used when joining data from other sources. AbNames is focused on matching names more than identifiers, so we considered that the official naming organisation should be our reference. This also means that using an Ensembl, Entrez or Uniprot ID to search for gene aliases may not work.

We found that the HGNC mappings between HGNC and Ensembl gene IDs almost always agreed. The HGNC uses SwissProt Uniprot IDs. When joining data from other sources, we tried to use two points of agreement, e.g. agreement between one type ID and the official symbol, or agreement between two types of ID but disagreement about the official symbol.

The code used for creating all of the data sets is available in data-raw. At the moment, we do not have a main file for regenerating the data sets - this is on our to do list!

TO DO

An example of (interactively) deciding what to do if groups are inconsistent. Function to check new annotations against previous Isotype controls + Add control column to isotype controls to prevent false match to mouse IgG2a CD25 (good example, multiple genes), CD270, CD279 matched in HGNC? * CD279 PD-1 - example where only some are matched * Fix Hao Clone CD25 - Is it custom? No Oligo or Cat_Number for this one * Generate camel case regexp? * Unintuitive to use quoted for splitUnnest? * Redocument data sets * HAO CD45 should have catalogue numbers! * union join back gene ID / get info by antigen * Put RRID into CITEseq * Export and demonstrate group_by_any? e.g. citeseq <- citeseq %>% group_by_any(c(“Antigen”, “Cat_Number”) showGroups(citeseq, 2, interactive = FALSE) ungroup(citeseq) * Add tag for non-human tagging antibodies * Problems: + HGNC_SYMBOL WRONG FOR CD158b (KIR2DL2/L3, NKAT2)? (NKAT2 = only one gene) + Triana RRID same for CD235a and CD235a - match via combination of RRID and Antigen + Triana group 5… - matching because of NA? * Update BIOMART entrez ids to match entrez / HGNC?? * Check that UNIPROT IDs are valid (not all from CSPA are up to date) * Change name of data gene_aliases to just aliases * Rename “value” to “alias” in aliases table * Final check for ambiguous aliases after adding proteins * HGNC: TRAV10 = TCRAV24S1, NCBI:TRAV24 = TCRAV24S1 * Be careful of case when removing ambiguous aliases - Met SLTM / MET * Hao CD25 - same HGNC ID reported twice HGNC:6008|HGNC:6008 and CD45RA * KLG / MAFA * Keep source in ID result table? * Is TCRg clone B1 = TCRg/d? * across query columns - na_if equal to antigen and not is antigen * Add to vignette, how to check results in matchToCITEseq * translit characters in query table * find replacement dataset for diamonds in test * make a pipeline for regenerating gene_aliases * From gene_aliases, remove entries where everything is identical except symbol_type - prefer alias to alias name, previous_symbol to previous_name etc - cellmarker %>% unique() * cellmarker Alkaline phosphatase maps to 2 IDs * symbol_type is NA * biomart gives different entrez ids? * CD1b has two different HGNC_IDs * SIGLEC5 * To do: what is the point of “name” in query table? * why two rows for CD124|IL4RA in alias_results - aggregate source * export left_join_any * make a wrapper for group_by_any which ignores clone duplicates * when study includes same antigen multiple times, give an underscore # suffix? * Convenience function - find alternative names for an antibody * To do: indicate cocktail membership * Should the ALT_ID be the symbol not the ID to allow easier matching? * Add checks in makeQueryTable, e.g. for brackets?

Session Info

sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.0.10     AbNames_0.2.0    BiocStyle_2.24.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] bslib_0.4.0         compiler_4.2.1      pillar_1.8.1       
#>  [4] BiocManager_1.30.18 jquerylib_0.1.4     tools_4.2.1        
#>  [7] digest_0.6.29       jsonlite_1.8.0      evaluate_0.16      
#> [10] memoise_2.0.1       lifecycle_1.0.2     tibble_3.1.8       
#> [13] pkgconfig_2.0.3     rlang_1.0.5         DBI_1.1.3          
#> [16] cli_3.4.0           yaml_2.3.5          pkgdown_2.0.6      
#> [19] xfun_0.33           fastmap_1.1.0       stringr_1.4.1      
#> [22] knitr_1.40          desc_1.4.2          generics_0.1.3     
#> [25] fs_1.5.2            sass_0.4.2          vctrs_0.4.1        
#> [28] systemfonts_1.0.4   tidyselect_1.1.2    rprojroot_2.0.3    
#> [31] glue_1.6.2          R6_2.5.1            textshaping_0.3.6  
#> [34] fansi_1.0.3         rmarkdown_2.16      bookdown_0.29      
#> [37] tidyr_1.2.1         purrr_0.3.4         magrittr_2.0.3     
#> [40] ellipsis_0.3.2      htmltools_0.5.3     assertthat_0.2.1   
#> [43] ragg_1.2.2          utf8_1.2.2          stringi_1.7.8      
#> [46] cachem_1.0.6

Helen Lindsay