Overview
AbNames performs three tasks:
Formatting antibody names for querying a gene aliases,
Matching human antibody names to gene names, IDs and protein complex IDs, and
Standardising antibody names to match a reference data set
Antibody names are not reported in a consistent format in published data. Antibodies are often named according to the antigen they target, which may not be the same as the name of the protein (complex) the antigen is part of. Antibodies may target multi-subunit protein complexes, and this can be reflected in the name, e.g. an antibody against the T-cell receptor alpha and beta subunits might be named TCRab. Antibody names may also include the name of the clone the antibody is derived from or the names of fluorophores or DNA-oligos the antibody is conjugated with. For these reasons, it can be difficult to exactly match antibody names with gene or protein names. As cell surface antigens often have very similar names, searching for partial matches to names in free-text gene descriptions is challenging and error-prone.
Data
AbNames contains several curated gene name data sets for matching to antibody names.
Gene aliases
The gene_aliases
data set is primarily based on the protein-coding and gene groups tables from the Human Gene Names Consortium (HGNC). These include previous (obsolete) gene names and aliases, which in our experience have been useful for matching to antibody names. Non-ambiguous gene aliases from Ensembl (fetched via Bioconductor package biomaRt) and NCBI (fetched via the NCBI ftp site and the Bioconductor package org.Hs.eg.db), and non-ambiguous proteins from the Cell-Surface Protein Atlas (CSPA, Cell-Surface Protein Atlas) and Cellmarker protein database (http://xteam.xbio.top/CellMarker/) have been added. Mappings between HGNC, Ensembl and NCBI (Entrez) IDs are mostly based on HGNC, with some corrections of obsolete Ensembl IDs using the Ensembl data.
The gene_aliases
data set is in long format (one alias per row).
Load using:
library(AbNames)
library(dplyr)
data("gene_aliases", package = "AbNames")
# Show the first entries of gene_aliases,
# where each row is the start of one column
dplyr::glimpse(gene_aliases)
#> Rows: 131,444
#> Columns: 10
#> $ HGNC_ID <chr> "HGNC:100", "HGNC:100", "HGNC:100", "HGNC:100", "HGNC:100"…
#> $ ENSEMBL_ID <chr> "ENSG00000110881", "ENSG00000110881", "ENSG00000110881", "…
#> $ UNIPROT_ID <chr> "P78348", "P78348", "P78348", "P78348", "P78348", "P78348"…
#> $ HGNC_SYMBOL <chr> "ASIC1", "ASIC1", "ASIC1", "ASIC1", "ASIC1", "ASIC1", "ASI…
#> $ ENTREZ_ID <chr> "41", "41", "41", "41", "41", "41", "41", "41", "41", "599…
#> $ BIOTYPE <chr> "protein_coding", "protein_coding", "protein_coding", "pro…
#> $ symbol_type <chr> "ALIAS", "ALIAS", "HGNC_NAME", "HGNC_SYMBOL", "PREVIOUS_NA…
#> $ value <chr> "BNaC2", "hBNaC2", "acid sensing ion channel subunit 1", "…
#> $ ALT_ID <chr> "HGNC:100", "HGNC:100", "HGNC:100", "HGNC:100", "HGNC:100"…
#> $ SOURCE <chr> "HGNC", "HGNC", "HGNC", "HGNC", "HGNC", "HGNC", "HGNC", "H…
# (Note: it isn't necessary to use dplyr:: to call "glimpse" as dplyr is loaded
# with the library call above. This syntax is used to make it clear which
# packages functions belong to)
BioLegend antibodies
BioLegend is a major supplier of antibodies, and provides several antibody panels for CITE-seq analyses. The data set “totalseq” is a re-formatted version of the TotalSeq barcodes data sheets available from the BioLegend website, including BioLegend antibody names and Ensembl gene IDs. The isotypes of the antibodies are not included in the TotalSeq data sheets, and only human antibodies and isotype controls are included. Missing and incomplete Ensembl IDs have been manually corrected.
Load using:
data("totalseq", package = "AbNames")
dplyr::glimpse(totalseq)
#> Rows: 977
#> Columns: 13
#> $ Cat_Number <chr> "305239", "305443", "329743", "329619", "309413", "33…
#> $ Oligo_ID <chr> "0005", "0006", "0007", "0008", "0009", "0010", "0014…
#> $ Antigen <chr> "CD80", "CD86", "CD274", "CD273", "CD275", "CD276", "…
#> $ Clone <chr> "2D10", "IT2.2", "29E.2A3", "24F.10C12", "2D3", "DCN.…
#> $ TotalSeq_Cat <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"…
#> $ Reactivity <chr> "Human", "Human, African Green, Baboon, Capuchin Monk…
#> $ Cross_Reactivity <chr> "Rhesus", NA, NA, NA, NA, NA, "Chimpanzee, Baboon, Cy…
#> $ Barcode_Sequence <chr> "ACGAATCAATCTGTG", "GTCTTTGTCAGTGCA", "GTTGTCCGACAATA…
#> $ Date_Released <chr> "07/13/2018", "06/08/2018", "07/24/2018", "06/08/2018…
#> $ ENSEMBL_ID <chr> "ENSG00000121594", "ENSG00000114013", "ENSG0000012021…
#> $ HGNC_ID <chr> "HGNC:1700", "HGNC:1705", "HGNC:17635", "HGNC:18731",…
#> $ HGNC_SYMBOL <chr> "CD80", "CD86", "CD274", "PDCD1LG2", "ICOSLG", "CD276…
#> $ ALT_ID <chr> "HGNC:1700", "HGNC:1705", "HGNC:17635", "HGNC:18731",…
Note that the “Antigen” column here refers to antibody names with prefixes such as “anti-human” removed.
CITE-seq antibodies
This is a table matching antibody names to gene and protein IDs from >20 data sets with publicly available CITE-seq data. The data sets are data that we have worked with and we would be happy to add other data sets if provided. The AbNames package collects the functions that were used to create this table. This table (NOT IN THERE YET) includes manually curated matches between antibody names and gene IDs for cases where we were unable to find an exact match to the antibody name provided. Isotype controls were manually identified based on either the name, e.g. “IgG1 Isotype Ctrl” or the reactivity, e.g. “Mouse IgG2b”.
Load using
# As the citeseq data set contains raw data, it is loaded differently
# than the other data sets
citeseq_fname <- system.file("extdata", "citeseq.csv", package = "AbNames")
citeseq <- read.csv(citeseq_fname) %>% unique()
dplyr::glimpse(citeseq)
#> Rows: 2,760
#> Columns: 11
#> $ Antigen <chr> "2B4", "4.1BB", "4.1BBL", "Annexin V", "anti-c-Met", "…
#> $ Study <chr> "Wu_2021", "Wu_2021", "Wu_2021", "Kotliarov_2020", "Li…
#> $ Clone <chr> "C1.7", "4B4-1", "5F4", NA, "12.1", "12.1", "PE001", "…
#> $ Cat_Number <chr> "329527", "309835", "311509", "custom made (similar to…
#> $ Oligo_ID <chr> "0189", "0355", "0022", "0025", "1055", NA, "0911", "0…
#> $ TotalSeq_Cat <chr> "A", "A", "A", "A", "C", NA, "A", "A", "C", "A", "A", …
#> $ Vendor <chr> "BioLegend", "BioLegend", "BioLegend", "BioLegend", "B…
#> $ Control <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ RRID <chr> "AB_2750007", "AB_2783173", "AB_2734284", NA, NA, NA, …
#> $ Lot <chr> NA, NA, NA, "B270560", NA, NA, NA, NA, NA, NA, NA, "B2…
#> $ Custom_Antibody <lgl> NA, NA, NA, TRUE, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
# TO DO: AS THIS IS ONLY A SUBSET OF THE CITESEQ INFO, THERE ARE APPARENT
# DUPLICATES, DECIDE WHAT TO DO WITH THESE
Here the column “Antigen” refers to the name given by the authors of the study in the table of resources used. Some minimal formatting has been done, for example fixing Greek characters that were accidentally transformed upon data import.
Creating a query table
To illustrate the problem with matching antibody names to gene names, let’s look at an example from the citeseq data set.
# The regular expression inside "grepl" searches for PD.L1, PD-L1, PDL1 or CD274
cd274 <- citeseq %>%
dplyr::filter(grepl("PD[\\.-]?L1|CD274", Antigen)) %>%
dplyr::pull(Antigen) %>%
unique()
cd274
#> [1] "CD274" "CD274 (B7-H1, PD-L1)" "CD274 (PD-L1)"
#> [4] "PD-L1" "PD-L1 (CD274)" "PDL1"
#> [7] "PDL1 (CD274)"
All of the antigen names above refer to the same cell surface protein.
Default query table
We will demonstrate how to create a default query table using the raw CITE-seq data set. This collects information from the reagents tables provided as supplementary material in the studies used.
# Add an ID column to the citeseq data set to allow the results table to be merged
citeseq <- AbNames::addID(citeseq)
#> ID columns do not uniquely identify rows, row numbers added.
# Remove control columns, as we are only searching human proteins
controls <- dplyr::filter(citeseq, Control)
citeseq <- dplyr::filter(citeseq, ! Control)
# Select just the columns that are needed for querying:
citeseq_q <- citeseq %>% dplyr::select(ID, Antigen)
# Apply the default transformation sequence and make a query table in long format
query_df <- AbNames::makeQueryTable(citeseq_q, ab = "Antigen")
# Print one example antigen from the query table:
query_df %>%
dplyr::filter(ID == "TCR alpha/beta__Hao_2021")
#> # A tibble: 6 × 3
#> ID name value
#> <chr> <chr> <chr>
#> 1 TCR alpha/beta__Hao_2021 TCR_long T cell receptor alpha locus
#> 2 TCR alpha/beta__Hao_2021 TCR_long T cell receptor beta locus
#> 3 TCR alpha/beta__Hao_2021 lower_no_dash tcr a/b
#> 4 TCR alpha/beta__Hao_2021 greek_letter TCR a/b
#> 5 TCR alpha/beta__Hao_2021 upper_no_dash TCR A/B
#> 6 TCR alpha/beta__Hao_2021 Antigen TCR alpha/beta
The example above shows that each antibody name has been reformatted in several ways. When querying a data set, we search for an exact match to any of these strings.
Custom query table
AbNames provides a few template functions that can be used to construct a pipeline for creating a query table. The default query table uses the function defaultQuery
to create a list of partial functions which are then applied recursively to the data.frame using magrittr::freduce
. To use this strategy, all functions must accept the data.frame as the only required argument. Other arguments should be filled in when creating the partial functions.
The default function list can used as a starting point to add extra formatting functions, remove or modify certain steps. The individual formatting functions are introduced below (SO FAR ONLY THE T-CELL RECEPTORS)
# Get the default sequence of formatting functions
default_funs <- AbNames::defaultQuery()
# Print the first two formatting functions as an example
default_funs[1:2]
#> [[1]]
#> <partialised>
#> function (...)
#> gsubAb(ab = "Antigen", ...)
#>
#> [[2]]
#> <partialised>
#> function (...)
#> gsubAb(ab = "Antigen", pattern = "\\s[Rr]ecombinant", ...)
T-cell receptors
We found that querying the HGNC gene description was the easiest way to match the names of antibodies against the T-cell receptor to their gene IDs. (An alternative method is to search for the gene symbol.)
We will start with a data.frame of antibodies against the T-cell receptor from the CITE-seq data set. We have already done some formatting of these names, e.g. for “TCR Va24-Ja18 (iNKT cell)” we have removed the section in brackets. We wish to create a query data.frame where each subunit of the TCR complex appears on a separate line.
Note: this function may cause false matches if the numbering of the gene name does not match that of the antibody.
tcr <- data.frame(Antigen =
c("TCR alpha/beta", "TCRab", "TCR gamma/delta", "TCRgd",
"TCR g/d", "TCR Vgamma9", "TCR Vg9", "TCR Vd2",
"TCR Vdelta2", "TCR Vα24-Jα18", "TCRVa24.Ja18",
"TCR Valpha24-Jalpha18", "TCR Vα7.2", "TCR Va7.2",
"TCRa7.2", "TCRVa7.2", "TCR Vbeta13.1", "TCR γ/δ",
"TCR Vβ13.1", "TCR Vγ9", "TCR Vδ2", "TCR α/β",
"TCRb", "TCRg"))
# First, we convert the Greek symbols to letters.
# Note that as we are using replaceGreekSyms in a dplyr pipeline, we don't
# put quotes around the column name, i.e. Antigen not "Antigen".
tcr <- tcr %>%
dplyr::mutate(query =
AbNames::replaceGreekSyms(Antigen, replace = "sym2letter"))
# Print out a few rows to see the result
tcr %>%
dplyr::filter(Antigen %in% c("TCR Vβ13.1", "TCR Vδ2", "TCR α/β"))
#> Antigen query
#> 1 TCR Vβ13.1 TCR Vb13.1
#> 2 TCR Vδ2 TCR Vd2
#> 3 TCR α/β TCR a/b
Now we use the function formatTCR
to format the antibody names for querying the gene description field of the HGNC data set.
tcr_f <- AbNames::formatTCR(tcr, tcr = "query")
# Print out the first few rows
tcr_f %>%
head()
#> Antigen query TCR_long
#> 1 TCR alpha/beta TCR alpha/beta T cell receptor alpha locus
#> 2 TCR alpha/beta TCR alpha/beta T cell receptor beta locus
#> 3 TCRab TCRab T cell receptor alpha locus
#> 4 TCRab TCRab T cell receptor beta locus
#> 5 TCR gamma/delta TCR gamma/delta T cell receptor gamma locus
#> 6 TCR gamma/delta TCR gamma/delta T cell receptor delta locus
Querying a dataset
Our pipeline is to query the HGNC and then use the other datasets and the antibody vendor information to find matches for the unmatched antibodies.
By default, we require that matches must be found for all subunits of a multi-subunit protein to avoid incorrect matches.
Here we will query the HGNC data set for the CITE-seq antibodies, using the query table created above.
alias_results <- searchAliases(query_df)
#> Joining, by = "value"
# Print 10 random results:
alias_results %>%
dplyr::select(ID, name, value, symbol_type) %>%
dplyr::ungroup() %>%
dplyr::sample_n(10)
#> # A tibble: 10 × 4
#> ID name value symbo…¹
#> <chr> <chr> <chr> <chr>
#> 1 CD29 (mouse)__Mimitou_2019 Antigen_split CD29 ALIAS
#> 2 CD85j (ILT2)__Stephenson_2021 Antigen_split|upper_no_dash CD85j|CD85… ALIAS
#> 3 CD27__Stuart_2019 Antigen CD27 HGNC_S…
#> 4 CD45RO__Mimitou_2019 Antigen CD45RO NA
#> 5 Siglec-8__Hao_2021__1 upper_no_dash SIGLEC8 HGNC_S…
#> 6 CD137L__PomboAntunes_2021 Antigen CD137L ALIAS
#> 7 CX3CR1__Qian_2020 Antigen CX3CR1 HGNC_S…
#> 8 IgM__Mimitou_2021 Ig IGHM HGNC_S…
#> 9 CD335 (NKp46)__Mimitou_2021 Antigen_split NKp46 CELLMA…
#> 10 CD268 (BAFF-R)__LeCoz_2021 Antigen_split|upper_no_dash BAFF-R|BAF… ALIAS
#> # … with abbreviated variable name ¹symbol_type
The results table above contains (just) the matches between the query table and the HGNC table. In the example above, we see in the “name” column the formatting function that generated the string in the “value” column that was matched, and in the “symbol_type” column which column of the HGNC data was matched. If there are multiple matches for a given ID, the official symbol (“HGNC_SYMBOL”) is preferred over aliases or previous symbols.
We may wish to review the matches before merging the results into the original table. For example, we can check for matches to different genes where the name was not guessed to be a multi-subunit protein.
alias_results <- alias_results %>%
dplyr::group_by(ID) %>%
dplyr::mutate(n_ids = dplyr::n_distinct(HGNC_ID)) # Count distinct IDs
# Select antigens where there are matches to multiple genes but not
# because the antibody is against a multi-gene protein
multi_gene <- alias_results %>%
dplyr::filter(n_ids > 1, ! all(name %in% c("TCR_long", "subunit"))) %>%
dplyr::select(ID, name, value, HGNC_ID)
# Look at the first group in multi_gene.
# Set interactive = FALSE for interactive exploration
showGroups(multi_gene, 1, interactive = FALSE)
#> Group 1 of 10: 3 rows
#> ID name value HGNC_ID
#> 1 CD11a_CD18 (LFA-1)__PomboAntunes_2021 Antigen CD11a HGNC:6148
#> 2 CD11a_CD18 (LFA-1)__PomboAntunes_2021 upper_no_dash CD11A HGNC:6148
#> 3 CD11a_CD18 (LFA-1)__PomboAntunes_2021 Antigen_split CD18 HGNC:6155
# This is an example where the antibody is against a heterodimeric protein.
# We can confirm this by looking up the vendor catalogue number:
citeseq %>%
dplyr::filter(ID == "CD11a.CD18-Wu_2021_b") %>%
dplyr::select(Antigen, Cat_Number, Vendor)
#> Adding missing grouping variables: `ID`
#> # A tibble: 0 × 4
#> # Groups: ID [0]
#> # … with 4 variables: ID <chr>, Antigen <chr>, Cat_Number <chr>, Vendor <chr>
NOTE: I haven’t written a convenience function for merging back into the original table yet. TODO: * QUERY FOR CD11/CD18 in protein ontology * Remove redundant results because of Antigen/greek_letter e.g. “KLRG1 (MAFA)-Qian_2020”
Examples to discuss?
“KIR2DL5” matches two genes HGNC:16345 HGNC:16346 "DR3 (TRAMP)__Liu_2021" - tramp alias is ambiguous
For now I will remove all genes with multiple matches before merging the results into the citeseq data.
nrow(alias_results)
#> [1] 2900
id_cols <- c("HGNC_ID", "HGNC_SYMBOL", "ENSEMBL_ID", "UNIPROT_ID")
# Remove matches to several genes, select just columns of interest
alias_results %>%
dplyr::select(matches("ID|HGNC"), name) %>% # Select ID and HGNC columns
unique() %>% # Collapse results with same ID from different queries
# Collapse multi-subunit entries, convert "NA" to NA
dplyr::summarise(dplyr::across(all_of(id_cols), ~toString(unique(.x)))) %>%
dplyr::mutate(dplyr::across(all_of(id_cols), ~na_if(.x, "NA")))
#> # A tibble: 2,608 × 5
#> ID HGNC_ID HGNC_SYMBOL ENSEMBL_ID UNIPROT_ID
#> <chr> <chr> <chr> <chr> <chr>
#> 1 2B4__Wu_2021 HGNC:18171 CD244 ENSG00000122223 Q9BZW8
#> 2 4.1BB__Wu_2021 HGNC:11924 TNFRSF9 ENSG00000049249 Q07011
#> 3 4.1BBL__Wu_2021 HGNC:11939 TNFSF9 ENSG00000125657 P41273
#> 4 Annexin V__Kotliarov_2020 HGNC:543 ANXA5 ENSG00000164111 P08758
#> 5 B220 (CD45R)__Hao_2021 HGNC:9666 PTPRC NA NA
#> 6 B220 (CD45R)__Mimitou_2019 HGNC:9666 PTPRC NA NA
#> 7 B7-H4__Hao_2021 HGNC:28873 VTCN1 ENSG00000134258 Q7Z7D3
#> 8 B7-H4__Liu_2021 HGNC:28873 VTCN1 ENSG00000134258 Q7Z7D3
#> 9 B7-H4__Mimitou_2021 HGNC:28873 VTCN1 ENSG00000134258 Q7Z7D3
#> 10 B7-H4__Qian_2020 HGNC:28873 VTCN1 ENSG00000134258 Q7Z7D3
#> # … with 2,598 more rows
nrow(alias_results)
#> [1] 2900
citeseq <- citeseq %>%
dplyr::left_join(alias_results, by = "ID") %>%
dplyr::relocate(ID, Antigen, Cat_Number, HGNC_ID) %>%
unique()
head(citeseq)
#> # A tibble: 6 × 25
#> # Groups: ID [6]
#> ID Antigen Cat_N…¹ HGNC_ID Study Clone Oligo…² Total…³ Vendor Control RRID
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr>
#> 1 2B4_… 2B4 329527 HGNC:1… Wu_2… C1.7 0189 A BioLe… FALSE AB_2…
#> 2 4.1B… 4.1BB 309835 HGNC:1… Wu_2… 4B4-1 0355 A BioLe… FALSE AB_2…
#> 3 4.1B… 4.1BBL 311509 HGNC:1… Wu_2… 5F4 0022 A BioLe… FALSE AB_2…
#> 4 Anne… Annexi… custom… HGNC:5… Kotl… NA 0025 A BioLe… FALSE NA
#> 5 anti… anti-c… NA NA Liu_… 12.1 1055 C BioLe… FALSE NA
#> 6 anti… anti-c… NA NA Step… 12.1 NA NA BioLe… FALSE NA
#> # … with 14 more variables: Lot <chr>, Custom_Antibody <lgl>, name <chr>,
#> # value <chr>, ENSEMBL_ID <chr>, UNIPROT_ID <chr>, HGNC_SYMBOL <chr>,
#> # ENTREZ_ID <chr>, BIOTYPE <chr>, symbol_type <chr>, ALT_ID <chr>,
#> # SOURCE <chr>, n_matches <int>, n_ids <int>, and abbreviated variable names
#> # ¹Cat_Number, ²Oligo_ID, ³TotalSeq_Cat
Filling missing information
Filling using the TotalSeq dataset
Another approach to annotating antibodies is to use the totalseq data. We are not confident about the assignment of antibodies to Ensembl identifiers. However, this may not be a problem if the aim is simply to standardise names between datasets or to a common reference.
The function searchTotalseq
by default matches sequentially to columns catalogue number (Cat_Number), Antigen, then Clone. Matching columns may be configured, but column names in the query data set should match those in the totalseq data set.
table(is.na(citeseq$HGNC_ID))
#>
#> FALSE TRUE
#> 2852 101
cs <- citeseq %>%
searchTotalseq()
#> Joining, by = c("ID", "Antigen", "Cat_Number", "HGNC_ID", "Study", "Clone",
#> "Oligo_ID", "TotalSeq_Cat", "Vendor", "Control", "RRID", "Lot",
#> "Custom_Antibody", "name", "value", "ENSEMBL_ID", "UNIPROT_ID", "HGNC_SYMBOL",
#> "ENTREZ_ID", "BIOTYPE", "symbol_type", "ALT_ID", "SOURCE", "n_matches",
#> "n_ids")
table(is.na(cs$HGNC_ID))
#>
#> FALSE TRUE
#> 2895 58
We can also try to annotate antibodies that could not be annotated using the HGNC by using the totalseq table. This could also be done as a first matching step as above. Here, we cannot add any more annotations.
count_missing <- function(df){
dplyr::filter(df, is.na(HGNC_ID)) %>% nrow()
}
data(totalseq)
ts <- totalseq %>%
dplyr::select(any_of(colnames(citeseq)))
missing <- count_missing(citeseq)
# Fill in IDs where Antigen, Oligo, Clone and TotalSeq category match
citeseq <- citeseq %>%
dplyr::rows_patch(ts,
by = c("Antigen", "Oligo_ID", "Clone", "TotalSeq_Cat"),
unmatched = "ignore")
missing_after_ts <- count_missing(citeseq)
# Missing before filling:
missing
#> [1] 101
# Still missing after filling with TotalSeq:
missing_after_ts
#> [1] 80
Inspecting groups before filling
When filling in information from a reference data set, it can be useful to look at entries where for example annotations are inconsistent. showGroups
is a function that allows users to interactively print group(s) from a grouped data.frame. We give a (non-interactive) example below.
Filling NAs using a reference data set
The function fillByGroup
is used to fill in NAs in a grouped data.frame. It differs from tidyr::fill
in its treatment of inconsistent values. Whereas tidyr::fill
will fill using the first value, fillByGroup
offers the option to fill in the most frequent value. This can be useful when filling antibody IDs given the antibody name and clone or catalogue number.
To fill using a reference data set, the strategy used is either to add a temporary ID column, join the two data.frames, fill and separate again using the ID column, or use dplyr::rows_patch. We will demonstrate the latter approach below.
# Before filling, check how many antibodies were matched.
original_nmatched <- count_missing(citeseq)
# Select some data to demonstrate filling:
# Get entries sharing the same catalogue number,
# where not every entry has a match
fill_demo <- citeseq %>%
dplyr::group_by(Cat_Number) %>%
dplyr::arrange(Cat_Number) %>%
dplyr::filter(!is.na(Cat_Number),
any(is.na(HGNC_ID)),
! all(is.na(HGNC_ID)))
AbNames::showGroups(fill_demo, interactive = FALSE)
#> Group 1 of 21: 3 rows
#> ID Antigen Cat_Number HGNC_ID
#> 1 CD45R_B220__PomboAntunes_2021 CD45R_B220 103263 HGNC:9666
#> 2 CD45R-B220__PomboAntunes_2021 CD45R-B220 103263 <NA>
#> 3 CD45R/B220__Qian_2020 CD45R/B220 103263 HGNC:9666
#> Study Clone Oligo_ID TotalSeq_Cat Vendor Control RRID
#> 1 PomboAntunes_2021 RA3-6B2 0103 A BioLegend FALSE AB_2734158
#> 2 PomboAntunes_2021 RA3-6B2 0103 A BioLegend FALSE AB_2734158
#> 3 Qian_2020 RA3-6B2 0103 A BioLegend FALSE <NA>
#> Lot Custom_Antibody name value ENSEMBL_ID UNIPROT_ID HGNC_SYMBOL
#> 1 <NA> NA Antigen CD45R <NA> <NA> PTPRC
#> 2 <NA> NA <NA> <NA> <NA> <NA> <NA>
#> 3 <NA> NA Antigen CD45R ENSG00000081237 <NA> PTPRC
#> ENTREZ_ID BIOTYPE symbol_type ALT_ID SOURCE n_matches n_ids
#> 1 <NA> <NA> <NA> PTPRC/iso:CD45R MANUAL_LOOKUP 1 1
#> 2 <NA> <NA> <NA> <NA> <NA> NA NA
#> 3 <NA> <NA> <NA> PTPRC/iso:CD45R MANUAL_LOOKUP 1 1
In the above example, we can see that for antibodies with catalogue number “300475”, a match was found if the antibody was named “CD3E” but not if it was named “CD3”. We can fill in the missing information using fillByGroup
:
fill_demo <- AbNames::fillByGroup(fill_demo, "Cat_Number",
fill = c("HGNC_ID", "TotalSeq_Cat", "Vendor",
"ENSEMBL_ID", "UNIPROT_ID"),
multiple = "mode") %>%
dplyr::group_by(Cat_Number) # Re-group as fillByGroup ungroups
# Print out the first group again
AbNames::showGroups(fill_demo, interactive = FALSE)
#> Group 1 of 21: 3 rows
#> ID Antigen Cat_Number HGNC_ID
#> 1 CD45R_B220__PomboAntunes_2021 CD45R_B220 103263 HGNC:9666
#> 2 CD45R-B220__PomboAntunes_2021 CD45R-B220 103263 HGNC:9666
#> 3 CD45R/B220__Qian_2020 CD45R/B220 103263 HGNC:9666
#> Study Clone Oligo_ID TotalSeq_Cat Vendor Control RRID
#> 1 PomboAntunes_2021 RA3-6B2 0103 A BioLegend FALSE AB_2734158
#> 2 PomboAntunes_2021 RA3-6B2 0103 A BioLegend FALSE AB_2734158
#> 3 Qian_2020 RA3-6B2 0103 A BioLegend FALSE <NA>
#> Lot Custom_Antibody name value ENSEMBL_ID UNIPROT_ID HGNC_SYMBOL
#> 1 <NA> NA Antigen CD45R ENSG00000081237 <NA> PTPRC
#> 2 <NA> NA <NA> <NA> ENSG00000081237 <NA> <NA>
#> 3 <NA> NA Antigen CD45R ENSG00000081237 <NA> PTPRC
#> ENTREZ_ID BIOTYPE symbol_type ALT_ID SOURCE n_matches n_ids
#> 1 <NA> <NA> <NA> PTPRC/iso:CD45R MANUAL_LOOKUP 1 1
#> 2 <NA> <NA> <NA> <NA> <NA> NA NA
#> 3 <NA> <NA> <NA> PTPRC/iso:CD45R MANUAL_LOOKUP 1 1
In the above call to fillByGroup, we set multiple = “mode”
Now we will fill the gene information in the citeseq data similarly, and check how many antibodies have not been matched.
# We fill by grouping the Catalogue number:
citeseq <- citeseq %>%
AbNames::fillByGroup("Cat_Number", multiple = "mode",
fill = c("HGNC_ID", "TotalSeq_Cat", "Vendor",
"ENSEMBL_ID", "UNIPROT_ID"))
nmatched_after_fill <- count_missing(citeseq)
print("Before filling:")
#> [1] "Before filling:"
original_nmatched
#> [1] 80
print("After filling:")
#> [1] "After filling:"
nmatched_after_fill
#> [1] 37
Notes about the gene aliases data set
The gene aliases data set is based on the annotation from the Human Genome Naming Consortium (HGNC). Annotation databases such as Ensembl and Entrez perform mapping between gene models independently, using different criteria. This means that they do not always agree on mappings between identifiers. In the gene aliases data set the HGNC mappings are given, and were used when joining data from other sources. AbNames is focused on matching names more than identifiers, so we considered that the official naming organisation should be our reference. This also means that using an Ensembl, Entrez or Uniprot ID to search for gene aliases may not work.
We found that the HGNC mappings between HGNC and Ensembl gene IDs almost always agreed. The HGNC uses SwissProt Uniprot IDs. When joining data from other sources, we tried to use two points of agreement, e.g. agreement between one type ID and the official symbol, or agreement between two types of ID but disagreement about the official symbol.
The code used for creating all of the data sets is available in data-raw. At the moment, we do not have a main file for regenerating the data sets - this is on our to do list!
TO DO
An example of (interactively) deciding what to do if groups are inconsistent. Function to check new annotations against previous Isotype controls + Add control column to isotype controls to prevent false match to mouse IgG2a CD25 (good example, multiple genes), CD270, CD279 matched in HGNC? * CD279 PD-1 - example where only some are matched * Fix Hao Clone CD25 - Is it custom? No Oligo or Cat_Number for this one * Generate camel case regexp? * Unintuitive to use quoted for splitUnnest? * Redocument data sets * HAO CD45 should have catalogue numbers! * union join back gene ID / get info by antigen * Put RRID into CITEseq * Export and demonstrate group_by_any? e.g. citeseq <- citeseq %>% group_by_any(c(“Antigen”, “Cat_Number”) showGroups(citeseq, 2, interactive = FALSE) ungroup(citeseq) * Add tag for non-human tagging antibodies * Problems: + HGNC_SYMBOL WRONG FOR CD158b (KIR2DL2/L3, NKAT2)? (NKAT2 = only one gene) + Triana RRID same for CD235a and CD235a - match via combination of RRID and Antigen + Triana group 5… - matching because of NA? * Update BIOMART entrez ids to match entrez / HGNC?? * Check that UNIPROT IDs are valid (not all from CSPA are up to date) * Change name of data gene_aliases to just aliases * Rename “value” to “alias” in aliases table * Final check for ambiguous aliases after adding proteins * HGNC: TRAV10 = TCRAV24S1, NCBI:TRAV24 = TCRAV24S1 * Be careful of case when removing ambiguous aliases - Met SLTM / MET * Hao CD25 - same HGNC ID reported twice HGNC:6008|HGNC:6008 and CD45RA * KLG / MAFA * Keep source in ID result table? * Is TCRg clone B1 = TCRg/d? * across query columns - na_if equal to antigen and not is antigen * Add to vignette, how to check results in matchToCITEseq * translit characters in query table * find replacement dataset for diamonds in test * make a pipeline for regenerating gene_aliases * From gene_aliases, remove entries where everything is identical except symbol_type - prefer alias to alias name, previous_symbol to previous_name etc - cellmarker %>% unique() * cellmarker Alkaline phosphatase maps to 2 IDs * symbol_type is NA * biomart gives different entrez ids? * CD1b has two different HGNC_IDs * SIGLEC5 * To do: what is the point of “name” in query table? * why two rows for CD124|IL4RA in alias_results - aggregate source * export left_join_any * make a wrapper for group_by_any which ignores clone duplicates * when study includes same antigen multiple times, give an underscore # suffix? * Convenience function - find alternative names for an antibody * To do: indicate cocktail membership * Should the ALT_ID be the symbol not the ID to allow easier matching? * Add checks in makeQueryTable, e.g. for brackets?
Session Info
sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.0.10 AbNames_0.2.0 BiocStyle_2.24.0
#>
#> loaded via a namespace (and not attached):
#> [1] bslib_0.4.0 compiler_4.2.1 pillar_1.8.1
#> [4] BiocManager_1.30.18 jquerylib_0.1.4 tools_4.2.1
#> [7] digest_0.6.29 jsonlite_1.8.0 evaluate_0.16
#> [10] memoise_2.0.1 lifecycle_1.0.2 tibble_3.1.8
#> [13] pkgconfig_2.0.3 rlang_1.0.5 DBI_1.1.3
#> [16] cli_3.4.0 yaml_2.3.5 pkgdown_2.0.6
#> [19] xfun_0.33 fastmap_1.1.0 stringr_1.4.1
#> [22] knitr_1.40 desc_1.4.2 generics_0.1.3
#> [25] fs_1.5.2 sass_0.4.2 vctrs_0.4.1
#> [28] systemfonts_1.0.4 tidyselect_1.1.2 rprojroot_2.0.3
#> [31] glue_1.6.2 R6_2.5.1 textshaping_0.3.6
#> [34] fansi_1.0.3 rmarkdown_2.16 bookdown_0.29
#> [37] tidyr_1.2.1 purrr_0.3.4 magrittr_2.0.3
#> [40] ellipsis_0.3.2 htmltools_0.5.3 assertthat_0.2.1
#> [43] ragg_1.2.2 utf8_1.2.2 stringi_1.7.8
#> [46] cachem_1.0.6