Biome Informatics

Train of Thought: Models of Bio-Computation

2021-04-05T00:00:00-07:00

This morning I read Scott Aaronson’s post musing on the computational expressiveness of his son’s toy train set. This immediately triggered for me a flood of memories about my undergraduate days at Cal, where Scott and I overlapped for a few years (while he was younger than me, he was already in grad school). While there, I had the great fortune of being a researcher in Adam Arkin’s computational biology lab, where I could soak up mind-expanding ideas from the grad students, post-docs, and staff researchers about biological computing¹, and computing about biology.

One area of interest was in determining what kinds of computation could be performed by cells or biochemical reaction networks²,³. While it has been shown that some cells have NAND-logic-like behavior⁴, it seemed that this was limiting, because there are only so many genes in a cell with that kind of behavior, and there are many other tricks up the sleeve of cells and DNA. It struck me that all of the string manipulation tricks that a genome can perform might provide a more expressive form of computation. While I did find one paper⁵ that tackled that directly, I thought that an analogous problem must have gathered much more attention, and thus there might be a richer vein of material to mine there.

I thought that the various proteins like RNA polymerase that process along the strands of DNA in a cell might be viewed as a train moving along a train track, where “stops” would be analogous to transcribing a gene⁶. I thought that all sorts of interesting modifications can happen to the “track”, like inversions. I dug for references online, and I came across this gem, which uses the common elements of train track switching to explore the universal computation capabilities of sufficiently designed train tracks: “Train Sets” by Adam Chalcraft and Michael Greene.

Alas, I lacked the training (no pun intended) to soak up all of this information and do anything significant with it. Till today I am fascinated with understanding how the cell meshes “analog” computation in the form of biochemical signal processing⁷, with the “digital” computation as expressed in the nucleotide sequence of the genome.

To find the above link, I had a dusty print-out that I had miraculously saved from my undergraduate days, but no further bibliographic information. Searching for the title and authors, I came across some related pages/articles:

Hopefully someone will find this assemblage of links amusing, or useful.

Lack of COVID-19 risk mitigation at SF GoodWill

2020-08-07T00:00:00-07:00

I don’t normally like complaining on the Internet, but I felt that this situation was particularly egregious, and risks people’s health needlessly.

This morning, after months of putting off donating unwanted home goods, I drove to the GoodWill drop-off center on Wisconsin Street in San Francisco. There was a “supervisor” who waved me in. Despite there being only one car in the parking lot, he had me park right behind that car. It seemed too close, but I figured that I could keep 6 ft. of separation as long as the people in front of me didn’t come near my trunk, where I would be unloading ~10 bags. Sensing that this would not be a fast drop-off, I donned my N95, and put a bandanna over it as well to cover the exhaust valve.

The first thing that I noticed was that, while having masks, many of the workers, including the supervisor, didn’t cover their nose with the mask, which defeats the purpose of the mask. Also, the workers in blue shirts weren’t keeping 6 feet of separation between themselves.

Secondly, I noticed that they had placed about 20 huge bins, ten on a side, on the two sides of the parking lot. So instead of a quick drop-off, now they were asking people to unpack all of their bags, and then run around the parking lot trying to figure out where to put things. The signs were very small, so you had to walk around to figure out what each one said. This seemed to take what, from a donor’s perspective, would be a low-risk quick drop-off of goods, to something that would require you to spend a fair amount of time unpacking your bags, and sorting by walking around the parking lot, which was quickly filling up behind me with more cars. It would be increasingly hard to maneuver around the bins and cars while maintaining 6 feet of separation between the workers and other donors milling about.

While I was asking the supervisor (the one with bare nostrils) about what goes where, I put my hand up to my ear as in “I can’t really hear you”, because there was a construction project right behind him, with a jack-hammer going off. He decided to come a bit closer to me, within 6 feet. So I backed up instead of telling him to back up, to try to avoid any misunderstandings. He instead took offense, saying, “Do you have a problem?”. I replied back to him, “No, it’s just that you’re not maintaining six feet of space.” He then replied, “Well, then you came to the wrong place, because here it’s real close.” By that point, there were five cars in the parking lot, all parked almost bumper-to-bumper.

At that point I said, “… then I’m leaving”; I no longer felt comfortable in that situation. While I want to support GoodWill’s mission, what they were doing there was destructive and harmful to their donors and to their workers.

This experience has given me little hope that we can keep the economy open. The SF Chronicle recently published an article saying that a study found that in a random group of 25 people in SF, there was a 34% chance that one of them has a SARS-CoV-2 infection. If businesses are allowing people to aggregate in large numbers without proper mask-wearing (e.g., covering your mouth and nose), and not caring about maintaining 6 feet of separation, then the spread of the pandemic will not be curtailed. I hope by writing this up I can bring this to the attention of GoodWill and the City of San Francisco, because my experience today was unacceptable from a public health perspective.

Design Flaws in COVID-19 Primers from Multiple International Labs

2020-04-05T00:00:00-07:00

Summary

Not only do the CDC COVID-19 primer-probe sets have design flaws, but other publicly-available primer-probe sets from government labs around the globe have their own flaws as well.

Methodology

I sought to perform detailed primer design analysis on all available COVID-19 primer-probe sets to allow for a fair comparison of whether such flaws are wide-spread, or particular to the CDC’s designs. I collated a set of primer-probe sets as found in a pre-print by Jung et al. and Corman et al. into a FASTA file available here. I applied the same analysis approach as described in my original post regarding COVID-19 design flaws unless otherwise noted. One complication I encountered with the expanded set of primers was the presence of ambiguous bases in some oligos, sometimes more than one per oligo. This is an issue as Primer3 cannot analyze such sequences. This was handled by expanding such sequences into all implicit combinations of unambiguous bases, and using Primer3 to analyze the sequences. In the results table, you will note the ID sub-field of “subset”: “subset-0” means that the original sequence was used as-is; “subset-N” for N=1,2,3,… represented enumerated sequences from the original sequence with ambiguous bases.

Results

The comparison table of all primer-probe sets versus a full set of tests for the forward primer, reverse primer, the hybridization probe, and combinations thereof is available on the COVID-19 Primer Data BitBucket repository. In the CSV file, values surrounded by “**” indicate a value that is above or below a quality control threshold (i.e., a design flaw).

In the next sections I will describe design flaws detected, organized by the origin of the primer-probe sets (in no particular order):

CDC

In my initial analysis of the CDC primers, I restricted myself to just the primers targeting the SARS-CoV-19 genome. In this latest analysis, I included the control primer-probe set targeting the RdRP gene:

The forward primer has five out of five of the 3’ bases as G/C (GC clamp)

China

China-N: A high-temperature hairpin loop in the reverse primer
China-N: A high-temperature hairpin loop in the hybridization probe
China-N: The hybridization probe’s Tm is not at least 6 degrees Celsius higher than the primers’ Tm
ORF1ab: The hybridization probe has a high-temperature hairpin loop

EU-Drosten

EU-Drosten-N’s reverse primer’s 3’ end GC clamp has four G/C bases in the last five base positions
EU-Drosten-N’s hybridization probe has a high-temperature hairpin loop
EU-Drosten-N’s probe Tm is not at least 6 degrees Celsius higher than the primers’ Tm
EU-Drosten-RdRP-P1 and EU-Drosten-RdRP-P2: forward primer has five out of five 3’ bases as G/C (GC clamp)
EU-Drosten-RdRP-P1 probe: 20 out of 32 sequences have high-temperature hairpin loops
EU-Drosten-RdRP-P1: 28 out of 32 explicit primer-probe sets have less than 6 degrees Celsius difference between probe Tm and primer Tm
EU-Drosten-RdRP-P2 probe: has a very high hairpin loop Tm of 79.14 degrees Celsius
EU-Drosten-RdRP-P2: explicit primer-probe sets have less than 6 degrees Celsius difference between probe Tm and primer Tm

Hong Kong

HKU-ORF-1b forward primer: 8 out of 16 explicit sequences have high hairpin loop Tm
HKU-N forward primer: high hairpin loop Tm
HKU-ORF-1b reverse primer: four out of five 3’ bases are G/C (GC clamp)
HKU-N probe: First 5’ base is G (interferes with fluorescence)
HKU-ORF-1b probe: 8 out of 16 explicit sequences have high temperature hairpin loops
HKU-N probe: high temperature hairpin loop
HKU-ORF-1b & HKU-N: all explicitly primer-probe sets have less than 6 degrees Celsius difference between probe Tm and primer Tm

Japan

Forward primer has high-temperature hairpin loop

Thailand

The probe Tm is not 6 degrees Celsius higher than the primers’ Tm

Discussion

A thorough check for common design flaws among an international set of primer-probe sets shows that many other countries have similar issues as the CDC primer-probe sets. While there is no guarantee that a theoretically-optimal primer-probe set will perform well /in vitro/ (and, conversely, there are many “flawed” primer probe sets in various applications that have been observed to work well in practice), it is inefficient to ignore these common pitfalls. This is because it only takes seconds for tools like Primer3 to screen these primer-probe sets for common flaws, while it takes significantly more time and resources to perform similar tests at the wet bench. Especially bearing in mind that these government labs are racing against time to develop efficient and accurate primer-probe sets at scale for testing entire populations against a rapidly-spreading pandemic, there should be no tolerance for testing theoretically-flawed primer-probe designs, when there are others free of such flaws available for experimental validation. While not necessarily optimal (and definitely not validated /in vitro/), I was able to generate such primer-probe sets targeting SARS-CoV-19 without these design flaws. Furthermore, all of the oligos in the primer-probe sets that I generated are unambiguous; better not to dilute one’s oligos needlessly with degenerate primer design. It is my hope that going forward there can be standardization around the primer design process, to allow these design flaws to be avoided systematically by government labs worldwide.

Updated set of COVID-19 Candidate Primers

2020-03-11T00:00:00-07:00

In the wake of my previous blog post, I have received numerous bits of advice on how to make my set of COVID-19 candidate primers better. Today I am releasing my second version of the candidate primers. Here is a summary of the updates:

All data resources are being updated on a Git repository on BitBucket, allowing easy sharing among bioinformaticists. Feel free to fork & send me pull requests to make this resources better.
Latest spreadsheet report (version 2) with recommended candidate primers highlighted in green (marginal primers highlighted in yellow).
Added many additional columns to the report to help in candidate primer probe set selection, including %GC of the oligos, the melting temperature of the oligos, and amplicon %GC and length (relative to the reference RefSeq COVID-19 complete genome). Column ‘U’ in particular reports the difference between the probe melting temperature, and the highest melting temperature of the forward and reverse primers. You want this difference to be larger than 6 degrees Celsius, to allow for efficient probe hybridization. I sort the report first on the F1-score (column ‘M’, descending) and the probe-primers melting temperature difference (column ‘U’, descending), to make it easy to find the best primer-probe sets.
Documentation on the meaning of the columns in the spreadsheet report.
Added the Primer3 parameter file, so that everyone can see the choice of parameter settings used for generating these primers. Of course there’s no one-size-fits-all, as different kits will be run on different machines, and thus there will be a need to modify the parameters to fit the given protocol. I welcome feedback on how I can create additional parameter files for various protocols.
A FASTA file of COVID-19 primer-probe sets that are currenlty deployed in testing kits.
Fixing of the NCBI Blast database sequence identifier issue (see original post)
Improvements to the amplicon-calling code and other fixes to improve performance

I hope this is another step towards making this work more useful and accessible. Please let me know if you have any technical questions or bug reports (code bugs or documentation bugs) by creating an issue in the Issue Tracker. For all else, you can find me on Twitter or email me (see bottom of page).

Technical Problems with Existing CDC COVID-19 Primers, and an Improved Set of Primers

2020-03-03T00:00:00-08:00

Summary

In this post I review technical problems with the CDC COVID-19 primers and I describe how I generated a new set of primers and probes without those problems.

Note that this was based on available outbreak whole-genome sequence data obtained from the NCBI Blast NT database, downloaded from NCBI on 2020-03-02. I will try to re-run this pipeline every few days to update the set of candidate primers in light of newly-available sequence data. Please check these primers yourself before using them for the basis of any diagnostic kit.

This blog post describes technical problems with some of the COVID-19 primer-probe sets that are being promoted by the CDC to diagnose cases of COVID-19 infections. Several of these primers have dimerization and hairpin loop issues, among others. Here, I describe a bioinformatic pipeline to design better candidate primers that pass stringent design criteria. Using this pipeline, I generated a new set of primers and probes without the aforementioned technical issues of the CDC COVID-19 primer probe sets, and tested their in silico precision and recall using the most recent set of COVID-19 complete genomes from NCBI. The ten best primer-probe pairs have perfect classification performance (recall, precision, and F1-score all 1.0). I provide these candidate primers for free, under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

My hope is that this blog post will help start a conversation about the design process for these critical diagnostic primers. Starting with these primers, and with input from other scientists, we can develop a set of candidate primers for COVID-19 that could be used by health agencies and private diagnostic kit manufacturers alike to not only generate more diagnostic kits, but better ones.

Primer Design Issues

As the COVID-19 pandemic spread from country-to-country over the past month, I felt that I should try to help in some way. I had recently been working on primer design as part of my consultancy, and I thought that perhaps I can analyze the primers developed by the CDC and others, in part to see how their primers stack up against the ones that I had developed for a different project.

As others have noted, there are flaws in the COVID-19 diagnostic primers from the CDC. They released four primer-probe sets; three that target the N gene, which encodes for a nucleocapsid phosphoprotein, and one that targets a human RNA polymerase gene (as part of laboratory controls).

Using Primer3, a respected software program for predicting performant primer probe sets using thermodynamic calculations, I analyzed the three N gene primer probe sets. I found the following problems with each set:

N1

Here is primer probe set N1, shown in Boulder format (as used by Primer3):

SEQUENCE_PRIMER=GACCCCAAAATCAGCGAAAT
SEQUENCE_INTERNAL_OLIGO=ACCCCGCATTACGTTTGGTGGACC
SEQUENCE_PRIMER_REVCOMP=TCTGGTTACTGCCAGTTGAATCTG

Here is the formatted analysis summary from Primer3 for N1 (note that in the terminology of Primer3, the forward primer is termed “Left”, the reverse primer is termed “Right”, and the hybridization probe is termed “Internal Oligo”):

OLIGO            start  len      tm     gc%  any_th  3'_th hairpin seq
LEFT PRIMER      28286   20   58.32   45.00    0.00   0.00    0.00 GACCCCAAAATCAGCGAAAT
RIGHT PRIMER     28357   24   62.50   45.83    2.43   0.00   52.19 TCTGGTTACTGCCAGTTGAATCTG
INTERNAL OLIGO   28308   24   69.19   58.33   13.48   0.00   42.14 ACCCCGCATTACGTTTGGTGGACC
SEQUENCE SIZE: 29902
INCLUDED REGION SIZE: 29902

PRODUCT SIZE: 72, PAIR ANY_TH COMPL: 0.00, PAIR 3'_TH COMPL: 0.00

The reverse primer was rejected by Primer3 due to a predicted hairpin loop. These loop formations can cause problems during PCR, leading to lower amplification efficiency. The Primer3 output for the hairpin is as follows:

SEQUENCE_ID=CDC-N1-COVID-19
Reverse primer:
Tm: 52.2°C  dG: -1596 cal/mol  dH: -34200 cal/mol  dS: -105 cal/mol*K
        5' TCTGGTTA*
            ||||   |
3' GTCTAAGTTGACCGTC*

While lower than the default melting temperature cutoff, Primer3 found more problems with the primer probe set. Namely, it can form oligo-dimers with the reverse primer and the hybridization probe (with themselves), and the hybridization probe itself can form a hairpin loop:

Reverse primer self-dimer:

Tm: 2.4°C  dG: -5495 cal/mol  dH: -45600 cal/mol  dS: -129 cal/mol*K
 5' TCTGGTTACTGCCAGTTGAATCTG 3'
           |||| ||||
3' GTCTAAGTTGACCGTCATTGGTCT 5'

Hybridization probe self-dimer:

Tm: 13.5°C  dG: -4045 cal/mol  dH: -87400 cal/mol  dS: -269 cal/mol*K
5' ACCCCGCATTACGTTTGGTGGACC 3'
      || |   ||||   | ||
3' CCAGGTGGTTTGCATTACGCCCCA 5'

Hybridization probe 3’ hairpin loop:

Tm: 42.1°C  dG: -274 cal/mol  dH: -16800 cal/mol  dS: -53 cal/mol*K
       5' ACCCCGCA*
              ||  T
3' CCAGGTGGTTTGCAT*

N2

Here is set N2:

SEQUENCE_PRIMER=TTACAAACATTGGCCGCAAA
SEQUENCE_INTERNAL_OLIGO=ACAATTTGCCCCCAGCGCTTCAG
SEQUENCE_PRIMER_REVCOMP=GCGCGACATTCCGAAGAA

Just from inspection, you can see that there is an issue with a poly-X run of five ‘C’s in the hybridization probe. Poly-X runs of five bases or longer are known to cause problems with non-specific priming (possibly leading to false positive readings), and are normally avoided.

Here is the formatted analysis summary from Primer3 for N2:

OLIGO            start  len      tm     gc%  any_th  3'_th hairpin seq
LEFT PRIMER      29163   20   58.75   40.00    0.00   0.00   34.76 TTACAAACATTGGCCGCAAA
RIGHT PRIMER     29229   18   60.10   55.56   12.76   0.00   36.75 GCGCGACATTCCGAAGAA
INTERNAL OLIGO   29187   23   68.15   56.52    8.63   0.00   49.09 ACAATTTGCCCCCAGCGCTTCAG
SEQUENCE SIZE: 29902
INCLUDED REGION SIZE: 29902

PRODUCT SIZE: 67, PAIR ANY_TH COMPL: 0.00, PAIR 3'_TH COMPL: 0.00

Primer3 found self-dimer and hairpin loop issues with all of the oligos in the N2 primer-probe set, though at lower temeratures.

N3

Here is set N3:

SEQUENCE_PRIMER=GGGAGCCTTGAATACACCAAAA
SEQUENCE_INTERNAL_OLIGO=ATCACATTGGCACCCGCAATCCTG
SEQUENCE_PRIMER_REVCOMP=TGTAGCACGATTGCAGCATTG

The forward primer, while having a C base in the last five base positions, has it at the 5’ end of that region, with a poly-A tail following. While runs of a single base of one to four base pairs in length are normally tolerated, it’s worrisome to place one at the 3’ end. It also had a predicted hairpin loop:

SEQUENCE_ID=CDC-N3-COVID-19
Reverse primer:
Tm: 53.8°C  dG: -1210 cal/mol  dH: -23500 cal/mol  dS: -72 cal/mol*K
   5' TGTAGCACG*
          |||  |
3' GTTACGACGTTA*

Here is the formatted analysis summary from Primer3 for N3:

OLIGO            start  len      tm     gc%  any_th  3'_th hairpin seq
LEFT PRIMER      28680   22   60.53   45.45    0.00   0.00    0.00 GGGAGCCTTGAATACACCAAAA
RIGHT PRIMER     28751   21   61.50   47.62    0.00   0.00   53.84 TGTAGCACGATTGCAGCATTG
INTERNAL OLIGO   28703   24   67.77   54.17    0.00   0.00   41.98 ATCACATTGGCACCCGCAATCCTG
SEQUENCE SIZE: 29902
INCLUDED REGION SIZE: 29902

PRODUCT SIZE: 72, PAIR ANY_TH COMPL: 0.00, PAIR 3'_TH COMPL: 0.00

Developing New Candidate Primers

Primer Design Criteria

In order to design better-performing COVID-19 primers, I developed a bioinformatic pipeline that generates primers using the following criteria:

Target regions conserved perfectly among all COVID complete genomes (41 downloaded on 2020-03-02; check NCBI Nucleotide for the current list)
Poly-X runs longer than four bases are not allowed
Check that hybridization probes do not start with G (avoid quenching)
Check for GC clamp on the forward and reverse primers (3’ end has one or two G’s or C’s in last five base pairs)
GC% range between 40 and 60, with an optimum at 50
Primer and probe size range between 18 to 27 bases
Amplicon size from 75 to 200 bases
Disallow any hairpins, self-dimers, or oligo-interactions at ANY temperature
Exclude regions that are identical to the four known common cold-causing, human-associated coronaviruses(229E, NL63, OC43, and HKU1)
Primers designed to cover the viral RNA polymerase gene, as it tends to be highly conserved within an RNA viral species, and different from the RNA polymerase sequence of different RNA viral species (as per Tom Slezak, former head of biodefense program at LLNL, private communication)

The New Candidate Primers Files

The set of candidate primers (last generated on 2020-03-11) can be downloaded here in XLSX spreadsheet format, and here as a comma-separated value (“CSV”) file. These are released as Creative Commons Attribution 4.0 International 4.0 (CC BY 4.0). I recommend focusing on the top ten primer-probe sets, as they have the greatest performance (based on the F1-score).

Below find selected details about how the primers were designed and validated using an ‘e-PCR’ approach.

Primer Performance Validation in silico

Even if you design your primers using recommended settings with trusted software like Primer3, there’s still the chance that you’ll have off-target binding of the oligos to other stretches of DNA in your sample, or even off-target amplicons if a forward and reverse primer bind close enough to one another in the correct orientation. For that reason, I analyzed the in silico performance of the newly designed primers to determine their predicted recall and precision. To do so, I performed a sequence homology search using NCBI Blast+ (version 2.10.0) against the NT database (downloaded 2020-03-02). Complete genomes (whether prokaryotic or viral) where the forward and reverse primers, and the hybridization probe, all had gapless alignments with 90% or greater sequence identity, and in the proper orientation, were counted as a hit. I used all complete viral genomes in NT with NCBI Taxonomy Database identifier 2697049 (the identifier for the COVID-19 sequences) as the “gold standard” for assessing the performance of these primer probe sets as diagnostics for COVID-19 presence (i.e., these are the sequences that the primers should hit without fail).

The spreadsheet shows that many of the newly-designed primers have five false negatives, but this is actually a bug in my current pipeline. It’s due to an obscure detail about how NCBI represents identical genomes in its Blast databases. Those genomes are all identified, just under a different identifier. So the top ten primer-probe pairs actually have perfect performance: precision, recall, and F1-score all 1.0.

Next Steps

Thank you for reading this far. My hope is that I can start up a conversation among scientists about how to design and validate candidate primer probes /in silico/, to allow for faster and more efficient wet-bench validation of the candidate primers. I hope this will lead to better, more accurate COVID-19 diagnostic kits being designed and successfully deployed around the country.

Please feel free to follow up with me via email (blog-at-me.tomeraltman.net) or Twitter (@tomeraltman). I look forward to constructive feedback about how to improve this analysis. I am working on releasing the source code for the bioinformatic pipeline, and writing up the methodology in more detail. Thanks again for your time, and thanks in advance for any help that you might provide.

TODOs:

Perform same analysis on WHO recommended COVID-19 primers
Release software pipeline code
Complete secondary structure code blocks for low-temperature entries
Fix false negative reporting bug due to NCBI Blast databases [FIXED: see update post]
Prepare technical manuscript detailing the methodology

Acknowledgments

My deepest gratitude to the following individuals for helping me by reviewing this post:

Updates

[2020-03-06] Incorporated suggestions and fixes from Nathan Walsh and Nora Callahan. Thanks!
[2020-03-11] Released latest version of COVID-19 candidate primers. Updated file links. See this update post for more information.
[2020-03-12] The formatted analysis summary for N3 was accidentally copied from N2. Fixed.

Hello, World!

2016-03-15T22:26:35-07:00

Welcome!

This is my personal blog, where I hope to share ideas and hacks that others might find helpful. You can find me around the web on GitHub, BitBucket, and LinkedIn. Thanks for stopping by!