Vaccination is one of our most effective weapons against disease. Vaccines are currently being designed for everything from coronaviruses and herpes to cocaine addiction and cockroach allergies. The emerging field of computational immunology is speeding up the discovery of vaccine candidates for COVID-19 and many other diseases; many open programs are available for this purpose.

The origins of vaccination are murky. The first written records of inoculation—or induction of immunity—date back to the 1500s, and suggest that the practice was likely first used to combat smallpox (possibly in China, Africa, India, or Turkey).

One pioneering method involved sampling material from a smallpox sore and scratching it into someone to vaccinate them. Another early inoculation strategy was to grind up smallpox scabs and blow the ground-up scabs up a person’s nose—hopefully after giving them a little warning. While this all sounds really uncomfortable, it probably seemed like a small price to pay to avoid a disease that claimed about 300 million lives during the 20th century alone.

The first bonafide vaccine against an infectious disease was introduced in 1796—when a British doctor named Edward Jenner found that infection with the relatively mild cowpox virus could make people immune to smallpox.

To date, vaccines have prevented many, many people from dying and getting sick. As an added bonus, they are now administered through far less horrifying methods!

The dramatic impact of vaccines. For all the diseases listed, the number of cases has dropped by over 90%.

Before diving into the ocean of computational immunology, it is wise to dip a toe in the pond of regular immunology—your immune system! This is really interesting stuff, but if you haven’t studied it before, it can seem like a lot of jargon. Fear not! If you find yourself confused or frustrated, skip ahead to the section “A Whirlwind Tour of the Immune System.”  A time-saving list of terms and definitions is also included at the end of the article. These terms appear in bold in the text.  

Computational immunology and vaccine development

Vaccines are an indispensable tool for promoting human and animal health. However, safe and effective vaccines  are still needed for many diseases; the recent coronavirus pandemic has brought this issue back into the spotlight.

High throughput (or next generation) sequencing methods allow entire genomes to be sequenced at once, providing an unprecedented opportunity to discover new vaccine candidates. However, it is too expensive and time-consuming to vet every possible vaccine candidate in animal models. The field of computational immunology is rapidly emerging as a solution to this big data problem.

Reverse vaccinology

Traditional, forward vaccinology involves vaccinating animals using  weakened or altered pathogens, then exposing the animals to the pathogen and measuring the effectiveness of the vaccine.

Reverse vaccinology is a prominent computational immunology method that instead begins with the complete genome sequence of a pathogen. The pathogen genome is analyzed for proteins with features that make them more likely to cause an immune response. The most promising candidates can then be vetted using traditional wet lab techniques.

One of the unique things about reverse vaccinology is that the first application of the method actually led to the development of a licensed vaccine. A 2000 study authored by someone with the last name Pizza[1] identified and validated 28 proteins as vaccine candidates for Group B meningococcus (MenB), the primary cause of sepsis and meningitis in children and young adults. Five of these vaccine candidates were used to formulate the Bexsero vaccine, which has been licensed in the United States and Europe. In under 18 months, reverse vaccinology helped identify more vaccine candidates for MenB than conventional methods uncovered in 40 years.

Today, many open-source reverse vaccinology programs are available. These programs are distinguished by the algorithms they use and the type of data they accept as inputs.

Many reverse vaccinology programs use rule-based filtering and/or machine learning classification algorithms.

Using reverse vaccinology to identify coronavirus vaccine candidates

Background: Of the seven known human coronaviruses, four cause only mild symptoms. The other three, more virulent strains include SARS-CoV, MERS-CoV, and SARS-CoV-2—the virus responsible for the COVID-19 pandemic.

Coronaviruses are positively-stranded RNA viruses, and have the largest known genomes of any RNA virus. The coronavirus genome is packed within the nucleocapsid protein (N) and surrounded by the membrane protein (M), envelope protein (E), and spike protein (S):


Two main strategies have been used to design coronavirus vaccines:

  • Whole virus vaccines, including inactivated and live attenuated vaccines. Live attenuated vaccines use a weakened form of the virus, while inactivated vaccines use proteins or other fragments of a pathogen.
  • Genetically engineered vaccine antigens (antigens are molecules on a pathogen that cause an immune response), which often target specific coronavirus proteins. Coronavirus antigens—such as the S, N, and M proteins—have been used in recombinant DNA vaccines and viral vector vaccines.While these vaccines offer some protection against SARS and MERS in animal models, they may not offer complete protection—which we can likely all agree is a goal. There may also be safety concerns, particularly for vulnerable groups[2]. As such, novel vaccine strategies are needed. Enter reverse vaccinology.

Dr. Yongqunh He and other researchers at the University of Michigan recently used Vaxign (a reverse vaccinology tool) and Vaxign-ML (a machine learning tool) to predict novel targets for a COVID-19 vaccine.

The Vaxign reverse vaccinology pipeline

Vaxign[3] is the first web-based reverse vaccinology program. It combines internally-developed algorithms with several open-source tools optimized to predict vaccine targets. To integrate these diverse tools, a MySQL relational database stores output data from one program and uses SQL query scripts to process the data as input into another program.

To search for promising candidates for SARS-CoV-2 vaccines, Dr. He's team input the full proteomes (protein sequences) of the seven known human coronavirus strains into the Vaxign reverse vaccinology pipeline, which is outlined in the table below.



Vaxign-ML outputs a protegenicity score—which is basically a prediction of how well a vaccine candidate induces protective immunity.

The Vaxign-ML model was built by training five supervised machine learning (ML) classification algorithms:

  1. Logistic regression
  2. Support vector machine (SVM)
  3. K-nearest neighbor (KNN)
  4. Random forest
  5. Extreme gradient boosting (XGB)

The best-performing algorithm (XGB) was used to calculate a protegenicity score for each SARS-CoV-2 protein.

Training Data: The training data included antigens that were protective in at least one animal experiment. Specifically, the positive samples included 397 bacterial and 178 viral protective antigens (PAgs) from the Protegen database (after removing proteins with over 30% sequence identity). The negative samples were extracted from Uniprot proteomes based on sequence dissimilarity to the PAgs.

Data Annotation: Each protein sequence in the resulting training data set was annotated with two categories of features: biological and physicochemical. 509 features were annotated for each protein sequence.

Biological features included adhesin probability[4], transmembrane helices[5], and immunogenicity[6] (the ability to induce an adaptive immune response).

Examples of physicochemical features include charge, polarity, and hydrophobicity; these features were predicted using Propy.

Comparison of Algorithm Performance:  The annotated protein database was used to train the five ML classification algorithms. The performance of all five algorithms was evaluated using nested five-fold cross-validation. The best-performing model (XGB) was named Vaxign-ML and used to predict protegenicity scores for all SARS-CoV-2 proteins. A protegenicity score greater than 90 indicates a strong vaccine candidate.

Vaxign-ML Output

Protegenicity score = (cl + 0.5fi)/N * 100

Here, cl is the count of all the scores from the Vaxign-ML prediction that are lower than the score of interest; fi is the frequency of the score of interest;  N is the number of samples in the original data.

Findings: Vaxign reverse vaccinology and Vaxign-ML

Vaxign reverse vaccinology identified six proteins as likely adhesins. One of these was the S protein, which is already being explored in several clinical trials (including at least two for SARS-CoV-2). The other five proteins—nsp3, 3CL-pro, nsp8, nsp9, and nsp10—are nonstructural, meaning they aren’t part of the virion (the complete virus particle).

The S protein had the highest protegenicity score (97.623), consistent with many other experiments that implicate this protein in coronavirus pathogenesis. Nsp3 had the second-highest protegenicity score (95.283), and is not currently being investigated in clinical trials.

Dr. He's study highlighted non-structural proteins, such as nsp3, as promising coronavirus vaccine candidates. These findings need to be tested in additional experiments, such as animal models. While structural proteins are commonly used as viral vaccine candidates, non-structural proteins also correlate with vaccine protection. The researchers also propose a cocktail vaccine containing both S and nsp proteins. Cocktail vaccines include more than one antigen, and can improve vaccine safety and efficacy.

A whirlwind tour of the immune system

If all this talk about pathogens, antigens, epitopes, paratopes, and antibodies has you feeling a little ill, you’ve come to the right place. Welcome to the immune system.

The immune system is your defense against pathogens. A pathogen is something that causes disease, but the term is typically used to refer to infectious things—like viruses, bacteria[7], and fungi.

Most vaccines work by causing certain white blood cells to produce antibodies. Antibodies—also called immunoglobulins—are Y-shaped proteins that tag pathogens, allowing other immune cells to find and destroy them. Sometimes, antibodies can directly neutralize a pathogen by interfering with its ability to enter cells, replicate, or carry out other functions.

Antibodies recognize pathogens by binding to proteins called antigens (“antibody generators”), often found on the surface of the pathogen. Simply put, antigens cause the immune system to respond.

The portion of the antigen that binds to the antibody is called the epitope, while the part of the antibody that binds to the antigen is called the paratope.


For vertebrates—those of us with spines—the immune system consists of two primary systems: innate immunity and adaptive immunity.

Innate immunity

The innate immune system is the body’s first line of defense against a pathogen, and its primary purpose is to keep pathogens outside the body. Innate immunity is non-specific. The innate immune system can respond very quickly to a general threat, but does not identify or “remember” a pathogen.

The innate immune system includes physical barriers (e.g., the skin), bodily secretions (e.g., mucous, bile, saliva, sweat, and tears), and general immune responses (e.g., inflammation and allergic reactions). Many types of white blood cells participate in innate immune responses:


Adaptive immunity

When innate immune defenses are not enough to keep a pathogen outside the body, adaptive immunity comes into play.

Unlike the innate immune system, the adaptive immune system responds to specific—rather than general—threats and “learns” through experience. Adaptive immunity is more of a long-game immune strategy.

Two types of white blood cells—B cells and T cells—are the stars of the adaptive immune system.


There are two primary types of adaptive immunity: humoral immunity and cell-mediated immunity.

Humoral Immunity

When pathogens have entered bodily fluids (i.e., humors), but have not yet infiltrated cells, they must deal with humoral immunity. Humoral immunity depends on antibodies.

Antibodies are produced by B cells. One of the fascinating things about B cells is that each B cell creates only one species of antibody, out of billions of possible species. This characteristic allows the immune system to respond to a highly versatile range of threats.


B cells share the same DNA, so how can they produce distinct antibodies? Antibodies have a variable portion that differs among antibody species. The variable portion can take many forms, depending on the outcome of genetic recombination during cellular development. Recombination essentially shuffles the section of the DNA that encodes the variable portion of the antibody, allowing for many possible antibody species to be created.

For a B cell to be activated (allowing it to clone itself), the paratope of one of its membrane-bound antibodies must bind to the epitope of a “matching” antigen. Additionally, B cells typically must be stimulated by another type of immune cell—called a helper T cell—to be activated.

Cell-mediated Immunity

When a pathogen has infiltrated cells, cell-mediated immunity kicks in. Cell-mediated immunity does not involve antibodies, but rather the activation of T cells and other immune cells in response to an antigen.


Although most vaccines work by causing B cells to create antibodies, T cells also play a key role in some vaccines. Similar to the antibodies on B cells, the T cell receptors (TCRs) on T cells have variable portions.

In addition to TCRs, T cells also have CD4+ or CD8+ proteins on their surfaces. Most CD4+ T cells are helper T cells, while most CD8+ cells are cytotoxic T cells.

Antigen-Presenting Cells

It is necessary to introduce another detail here. B cells—along with macrophages and dendritic cells—are antigen presenting cells (APCs). In other words, they display fragments of a pathogen on their surfaces, so it can be recognized by other cells.

When a membrane-bound B cell antibody binds to an antigen, the B cell devours and breaks down the antigen and attached antibody. Within the B cell, fragments of the antigen are attached to special proteins called major histocompatibility complexes (MHCs). The MHC and antigen fragment are presented on the surface of the B cell.

Two types of MHCs are presented by APCs: MHC I and MHC II. B cells (along with dendritic cells and phagocytes) display MHC II, which can be recognized by helper T cells to complete B cell activation.

While MHC II attracts helper T cells, MHC I attracts killer (cytotoxic) T cells. MHC I is expressed by all cells with a nucleus (in humans, this includes basically all cells except red blood cells). Any cell—even a cancerous cell—can present some of its proteins with MHC I on its surface. This display can indicate to killer T cells that a cell is damaged and needs to be destroyed.

Closing remarks

Emerging infectious diseases, vulnerable populations, and the need for personalized vaccines present serious public health challenges. However, advances in sequencing technologies have provided a wealth of data about pathogens, creating many new opportunities to solve these problems.

Reverse vaccinology and other computational methods have already led to the development of novel vaccines. These techniques are quickly making their mark in the public health arena, and will likely speed up the process of vaccine discovery.

Are you using computer science to fight disease? Is there a particular tool or algorithm you would like to see explained or demonstrated? Let me know at


[1] I haven’t been this jealous of a researcher’s name since I came across Lord Brain.
[2] And definitely for ferrets, in which one of these vaccines was linked to increased liver pathology.
[3] Vaxign source code for version 2.0 (beta) can be found here. The Vaxign reverse vaccinology web interface is available here.
[4] Predicted using SPAAN (neural networks).
[5] Predicted using  PSORTB (Hidden Markov Model-based method HMMTOP).
[6] Predicted using the Vaxitop IEDB consensus method (Position Specific Scoring Matrix).
[7] Specifically, harmful bacteria.  


Bassaganya-Riera, Josep, and Raquel Hontecillas. “Introduction to Computational Immunology.” Computational Immunology, 2016, 1–8. (

Boylston, Arthur. “The Origins of Inoculation.” Journal of the Royal Society of Medicine 105, no. 7 (2012): 309–13. (

Clark, Mary Ann, Jung Ho Choi, and Matthew M. Douglas. “42.2 Adaptive Immune Response.†In Openstax Biology 2e. Houston, TX: OpenStax, Rice University, 2018.

He, Yongqun, Zuoshuang Xiang, and Harry L. T. Mobley. “Vaxign: The First Web-Based Vaccine Design Program for Reverse Vaccinology and Applications for Vaccine Development.” Journal of Biomedicine and Biotechnology 2010 (2010): 1–15. (

“Immune System.” Khan Academy. Khan Academy, n.d. (

María, Ribas‐Aparicio Rosa, Castelán‐Vega Juan Arturo, Jiménez‐ Alberto Alicia, Monterrubio‐López Gloria Paulina, and Aparicio‐ Ozores Gerardo. “The Impact of Bioinformatics on Vaccine Design and Development.” Vaccines, June 2017. (

Oli, Angus Nnamdi, Wilson Okechukwu Obialor, Martins Ositadimma Ifeanyichukwu, Damian Chukwu Odimegwu, Jude Nnaemeka Okoyeh, George Ogonna Emechebe, Samson Adedeji Adejumo, and Gordon C Ibeanu. “Immunoinformatics and Vaccine Development: An Overview.” ImmunoTargets and Therapy Volume 9 (2020): 13–30. (

Ong, Edison, Haihe Wang, Mei U Wong, Meenakshi Seetharaman, Ninotchka Valdez, and Yongqun He. “Vaxign-ML: Supervised Machine Learning Reverse Vaccinology Model for Improved Prediction of Bacterial Protective Antigens.” Bioinformatics, 2020. (

Ong, Edison, Mei U Wong, Anthony Huffman, and Yongqun He. “COVID-19 Coronavirus Vaccine Design Using Reverse Vaccinology and Machine Learning,” 2020. (

Orenstein, Walter A., and Rafi Ahmed. “Simply Put: Vaccination Saves Lives.” Proceedings of the National Academy of Sciences 114, no. 16 (October 2017): 4031–33. (

Key Terms

If you decide to explore computational immunology, these are some terms you will likely encounter.

Adaptive Immunity: acquired, antigen-specific immune responses; one of two main vertebrate immune strategies
Adhesin: proteins or appendages on the surface of a cell that allow the cell to attach to other cells or surfaces
Antibodies: specialized, Y-shaped proteins that neutralize or tag a pathogen for destruction by binding to its antigen
Antigens: molecules found on pathogens that stimulate an immune response
Antigenicity: the ability of a substance to cause specific antibodies to be produced
Antigen-presenting cells (APCs): cells that display antigens attached to major histocompatibility complexes (MHCs) on their surfaces; also called accessory cells
B cells: white blood cells that are produced and mature in bone marrow, and contribute to adaptive immunity by producing antibodies and aiding in immunological memory
Cell-mediated immunity: adaptive immune defense, in which foreign cells are destroyed by T cells
Chemokines: a type of cytokine that infected cells release to trigger an immune response and alert neighboring cells
Complement system: part the immune system that enhances the ability of antibodies and phagocytes to fight pathogens
Cytokines: molecules that cells use to communicate (e.g., to initiate an immune response or trigger cell movement)
Epitope: the portion of an antigen that attaches to an antibody
Etiological: causing or contributing to disease
High throughput sequencing (HTS): scalable genome sequencing techniques that allow entire genomes to be sequenced at once; also called next-generation sequencing (NGS)
HLAs: human leukocyte antigens; genes that encode the major histocompatibility complex (MHC) proteins in humans; HLA-A, HLA-B, and HLA-C correspond to MHC I; HLA-DP, HLA-DM, HLA-DO, HLA-DQ, and HLA-DR correspond to MHC II
Helper T cells: adaptive immune cells that help activate B cells and cytotoxic T cells
Humoral immunity: an adaptive immune defense that depends on antibodies
IgG: the most common type of antibody, created and released by plasma B cells
Immunogenicity: the ability of a substance to induce a humoral (antibody-dependent) or cell-mediated immune response
Innate Immunity: non-specific, fast-acting immune defense mechanisms; one of two main vertebrate immune strategies
In silico: describes research carried out through computational modeling
In vitro: describes research occurring outside of a living organism, for example, in a test tube or culture dish
In vivo: describes research occurring inside of a living organism
MHC I: a class of major histocompatibility complex (MHC) molecules, found of the surface of all cells with a nucleus, that display peptide fragments of an antigen to cytotoxic T cells in order to trigger an immune response; one of two major classes of major histocompatibility complex
MHC II: a class of major histocompatibility complex (MHC) molecules, typically found only on professional antigen presenting cells and a few other cell types, that interact with immune cells (like helper T cells) to elicit an immune response; one of two major classes of major histocompatibility complex
Naive B and T cells: B and T cells that have never been activated and are not memory cells or effector cells
Nonstructural protein: A protein that is encoded by the genome of a virus, but is not part of the viral particle
PAgs: protective antigens
Paratope: the portion of an antibody that binds to an antigen
Pathogens: disease-causing microorganisms, such as viruses, harmful bacteria, and toxins
Peptide: molecules consisting of between two and fifty amino acids, distinguished from proteins by their smaller size and less defined structure
Phagocytes: “cell-eaters”; cells that can engulf small cells and particles
Physicochemical properties: physical and chemical properties of a substance
Professional antigen-presenting cells: macrophages, B cells, and dendritic cells; cells that present antigens to helper T cells
Protegenicity: protective antigenicity; the ability of a substance to elicit a protective immune response by stimulating the production of specific antibodies
Proteome: the complete set of proteins that can be expressed by a genome, cell, tissue, or organism at a given time
Recombinant vaccines: vaccines produced via recombinant DNA technology, in which the DNA encoding an antigen is inserted, expressed, and purified in bacterial or mammalian cells
Reverse vaccinology: a vaccinology method that uses bioinformatics approaches to screen pathogen genomes for optimal vaccine targets
Sequence conservation: the presence of similar or identical DNA, RNA or protein sequences across species or within a genome
Sequence motif: a pattern of nucleotides or amino acids in a sequence that has a specific function
Subcellular localization: a prediction of where a protein is located in a cell
Supertype: a group of HLA alleles that bind to a similar set of peptides
T cells: white blood cells that are produced in bone marrow and mature in the thymus; helper T cells assist B cells, while killer (or cytotoxic) T cells directly kill infected cells
T cell receptors (TCRs): proteins found on T cells that recognize antigen fragments bound to major histocompatibility complexes (MHCs)
Transmembrane domain: a region of a protein that crosses the cell membrane
Vaccine: a killed or weakened pathogen (or pathogen fragment), that elicits an immune response when introduced into the body
Viral vector vaccines: vaccines that use live, chemically-weakened viruses to provoke an immune response
Virion: a complete viral particle, consisting of a DNA or RNA core and a capsid
Virus: a particle—classified as non-living—that contains protein, as well as DNA or RNA, and infects living cells  


Share Article

Continue Reading