<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.0.1">Jekyll</generator><link href="https://tomeraltman.net/feed.xml" rel="self" type="application/atom+xml" /><link href="https://tomeraltman.net/" rel="alternate" type="text/html" /><updated>2023-11-30T13:46:16-08:00</updated><id>https://tomeraltman.net/feed.xml</id><title type="html">Biome Informatics</title><subtitle>A blog for me to capture my thoughts and showcase my projects, mostly of a technical nature. Topics will be in the orbit of bioinformatics, computational biology, data science, and information security. </subtitle><entry><title type="html">Train of Thought: Models of Bio-Computation</title><link href="https://tomeraltman.net/2021/04/05/models-of-biological-computation.html" rel="alternate" type="text/html" title="Train of Thought: Models of Bio-Computation" /><published>2021-04-05T00:00:00-07:00</published><updated>2021-04-05T00:00:00-07:00</updated><id>https://tomeraltman.net/2021/04/05/models-of-biological-computation</id><content type="html" xml:base="https://tomeraltman.net/2021/04/05/models-of-biological-computation.html"><![CDATA[<p>This morning I read <a href="https://www.scottaaronson.com/blog/?p=5402">Scott Aaronson’s post musing on the computational
expressiveness of his son’s toy train
set</a>. This immediately
triggered for me a flood of memories about my undergraduate days at
Cal, where Scott and I overlapped for a few years (while he was
younger than me, he was already in grad school). While there, I had
the great fortune of being a researcher in <a href="https://biosciences.lbl.gov/profiles/adam-p-arkin/">Adam Arkin’s computational
biology lab</a>,
where I could soak up mind-expanding ideas from the grad students,
post-docs, and staff researchers about biological
computing<sup id="fnref:dnacompute" role="doc-noteref"><a href="#fn:dnacompute" class="footnote" rel="footnote">1</a></sup>, and computing about biology.</p>

<p>One area of interest was in determining what kinds of computation could
be performed by cells or biochemical reaction networks<sup id="fnref:turing" role="doc-noteref"><a href="#fn:turing" class="footnote" rel="footnote">2</a></sup>,<sup id="fnref:fsm" role="doc-noteref"><a href="#fn:fsm" class="footnote" rel="footnote">3</a></sup>. While it has
been shown that some cells have NAND-logic-like behavior<sup id="fnref:NAND" role="doc-noteref"><a href="#fn:NAND" class="footnote" rel="footnote">4</a></sup>, it seemed
that this was limiting, because there are only so many genes in a cell
with that kind of behavior, and there are many other tricks up the
sleeve of cells and DNA. It struck me that all of the string
manipulation tricks that a genome can perform might provide a more
expressive form of computation. While I did find one paper<sup id="fnref:splicing" role="doc-noteref"><a href="#fn:splicing" class="footnote" rel="footnote">5</a></sup> that
tackled that directly, I thought that an analogous problem must have
gathered much more attention, and thus there might be a richer vein of
material to mine there.</p>

<p>I thought that the various proteins like RNA polymerase that process
along the strands of DNA in a cell might be viewed as a train moving
along a train track, where “stops” would be analogous to transcribing
a gene<sup id="fnref:firecracker" role="doc-noteref"><a href="#fn:firecracker" class="footnote" rel="footnote">6</a></sup>. I thought that all sorts of interesting modifications can
happen to the “track”, like inversions. I dug for references online,
and I came across this gem, which uses the common elements of train
track switching to explore the universal computation capabilities of
sufficiently designed train tracks: <a href="http://www.monochrom.at/turingtrainterminal/Chalcraft.pdf">“Train Sets” by Adam Chalcraft and Michael
Greene</a>.</p>

<p>Alas, I lacked the training (no pun intended) to soak up all of this
information and do anything significant with it. Till today I am
fascinated with understanding how the cell meshes “analog” computation
in the form of biochemical signal processing<sup id="fnref:sigproc" role="doc-noteref"><a href="#fn:sigproc" class="footnote" rel="footnote">7</a></sup>, with the “digital”
computation as expressed in the nucleotide sequence of the
genome.</p>

<p>To find the above link, I had a dusty print-out that I had
miraculously saved from my undergraduate days, but no further
bibliographic information. Searching for the title
and authors, I came across some related pages/articles:</p>

<ul>
  <li><a href="https://esolangs.org/wiki/Chalcraft-Greene_train_track_automaton">Chalcraft-Greene train track automaton</a></li>
  <li><a href="https://www.i-programmer.info/news/112-theory/12067-computing-with-trains-turings-trains.html">Computing With Trains - Turing’s Trains</a></li>
  <li><a href="bit-player.org/wp-content/extras/bph-publications/AmSci-2007-03-Hayes-trains.pdf">Article: Trains of Thought</a></li>
</ul>

<p>Hopefully someone will find this assemblage of links amusing, or
useful.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:dnacompute" role="doc-endnote">
      <p><a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.9196">Recent Developments in DNA-Computing</a> <a href="#fnref:dnacompute" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:turing" role="doc-endnote">
      <p><a href="https://www.researchgate.net/publication/253809773_Chemical_Kinetics_is_Turing_Universal">Chemical Kinetics is Turing Universal</a> <a href="#fnref:turing" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:fsm" role="doc-endnote">
      <p><a href="https://www.researchgate.net/publication/11743809_Chemical_implementation_of_finite-state_machines">Chemical Implementation of Finite-State Machines</a> <a href="#fnref:fsm" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:NAND" role="doc-endnote">
      <p><a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.4499">Cellular Gate Technology</a> <a href="#fnref:NAND" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:splicing" role="doc-endnote">
      <p><a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.7963">On the Power of Circular Splicing Systems and DNA Computability</a> <a href="#fnref:splicing" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:firecracker" role="doc-endnote">
      <p><a href="https://pubmed.ncbi.nlm.nih.gov/11719800/">Programmable and autonomous computing machine made of biomolecules</a> <a href="#fnref:firecracker" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:sigproc" role="doc-endnote">
      <p><a href="https://www.researchgate.net/publication/4033212_Motifs_and_modules_in_cellular_signal_processing_applications_to_microbial_stress_response_pathways">Motifs and Modules in Cellular Signal Processing</a> <a href="#fnref:sigproc" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><summary type="html"><![CDATA[This morning I read Scott Aaronson’s post musing on the computational expressiveness of his son’s toy train set. This immediately triggered for me a flood of memories about my undergraduate days at Cal, where Scott and I overlapped for a few years (while he was younger than me, he was already in grad school). While there, I had the great fortune of being a researcher in Adam Arkin’s computational biology lab, where I could soak up mind-expanding ideas from the grad students, post-docs, and staff researchers about biological computing1, and computing about biology. Recent Developments in DNA-Computing &#8617;]]></summary></entry><entry><title type="html">Lack of COVID-19 risk mitigation at SF GoodWill</title><link href="https://tomeraltman.net/2020/08/07/goodwill-covid-risk.html" rel="alternate" type="text/html" title="Lack of COVID-19 risk mitigation at SF GoodWill" /><published>2020-08-07T00:00:00-07:00</published><updated>2020-08-07T00:00:00-07:00</updated><id>https://tomeraltman.net/2020/08/07/goodwill-covid-risk</id><content type="html" xml:base="https://tomeraltman.net/2020/08/07/goodwill-covid-risk.html"><![CDATA[<p>I don’t normally like complaining on the Internet, but I felt that
this situation was particularly egregious, and risks people’s health
needlessly.</p>

<p>This morning, after months of putting off donating unwanted home
goods, I drove to the GoodWill drop-off center on Wisconsin Street in
San Francisco. There was a “supervisor” who waved me in. Despite
there being only one car in the parking lot, he had me park <em>right</em>
behind that car. It seemed too close, but I figured that I could keep
6 ft. of separation as long as the people in front of me didn’t come
near my trunk, where I would be unloading ~10 bags. Sensing that this
would not be a fast drop-off, I donned my N95, and put a bandanna over
it as well to cover the exhaust valve.</p>

<p>The first thing that I noticed was that, while having masks, many of
the workers, including the supervisor, didn’t cover their nose with
the mask, which defeats the purpose of the mask. Also, the workers in
blue shirts weren’t keeping 6 feet of separation between themselves.</p>

<p>Secondly, I noticed that they had placed about 20 huge bins, ten on a
side, on the two sides of the parking lot. So instead of a quick
drop-off, now they were asking people to unpack all of their bags, and
then run around the parking lot trying to figure out where to put
things. The signs were very small, so you had to walk around to figure
out what each one said. This seemed to take what, from a donor’s
perspective, would be a low-risk quick drop-off of goods, to something
that would require you to spend a fair amount of time unpacking your
bags, and sorting by walking around the parking lot, which was quickly
filling up behind me with more cars. It would be increasingly hard to
maneuver around the bins and cars while maintaining 6 feet of
separation between the workers and other donors milling about.</p>

<p>While I was asking the supervisor (the one with bare nostrils) about
what goes where, I put my hand up to my ear as in “I can’t really hear
you”, because there was a construction project right behind him, with
a jack-hammer going off. He decided to come a bit closer to me, within
6 feet. So I backed up instead of telling him to back up, to try to
avoid any misunderstandings. He instead took offense, saying, “Do you
have a problem?”. I replied back to him, “No, it’s just that you’re
not maintaining six feet of space.” He then replied, “Well, then you
came to the wrong place, because here it’s real close.” By that point,
there were five cars in the parking lot, all parked almost
bumper-to-bumper.</p>

<p>At that point I said, “… then I’m leaving”; I no longer felt
comfortable in that situation. While I want to support GoodWill’s
mission, what they were doing there was destructive and harmful to
their donors and to their workers.</p>

<p>This experience has given me little hope that we can keep the economy
open. The SF Chronicle recently published an article saying that a
study found that in a random group of 25 people in SF, there was a 34%
chance that one of them has a SARS-CoV-2 infection. If businesses are
allowing people to aggregate in large numbers without proper
mask-wearing (e.g., covering your mouth <em>and</em> nose), and not caring
about maintaining 6 feet of separation, then the spread of the
pandemic will not be curtailed. I hope by writing this up I can bring
this to the attention of GoodWill and the City of San Francisco,
because my experience today was unacceptable from a public health
perspective.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[I don’t normally like complaining on the Internet, but I felt that this situation was particularly egregious, and risks people’s health needlessly.]]></summary></entry><entry><title type="html">Design Flaws in COVID-19 Primers from Multiple International Labs</title><link href="https://tomeraltman.net/2020/04/05/all-public-primers.html" rel="alternate" type="text/html" title="Design Flaws in COVID-19 Primers from Multiple International Labs" /><published>2020-04-05T00:00:00-07:00</published><updated>2020-04-05T00:00:00-07:00</updated><id>https://tomeraltman.net/2020/04/05/all-public-primers</id><content type="html" xml:base="https://tomeraltman.net/2020/04/05/all-public-primers.html"><![CDATA[<h2 id="summary">Summary</h2>

<p>Not only do the CDC COVID-19 primer-probe sets have design flaws, but
other publicly-available primer-probe sets from government labs around
the globe have their own flaws as well.</p>

<h2 id="methodology">Methodology</h2>

<p>I sought to perform detailed primer design analysis on all available
COVID-19 primer-probe sets to allow for a fair comparison of whether
such flaws are wide-spread, or particular to the CDC’s designs. I
collated a set of primer-probe sets as found in <a href="https://www.biorxiv.org/content/10.1101/2020.02.25.964775v1">a pre-print by Jung
et al.</a>
and <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6988269/">Corman et
al.</a>
into a FASTA file available
<a href="https://bitbucket.org/tomeraltman/covid-19-primer-data/src/master/kit-primers/all-COVID-19-kit-primers.fasta">here</a>.
I applied the same analysis approach as described <a href="https://tomeraltman.net/2020/03/03/technical-problems-COVID-primers.html">in my original post
regarding COVID-19 design
flaws</a>
unless otherwise noted. One complication I encountered with the
expanded set of primers was the presence of ambiguous bases in some
oligos, sometimes more than one per oligo. This is an issue as Primer3
cannot analyze such sequences. This was handled by expanding such
sequences into all implicit combinations of unambiguous bases, and
using Primer3 to analyze the sequences. In the results table, you will
note the ID sub-field of “subset”: “subset-0” means that the original
sequence was used as-is; “subset-N” for N=1,2,3,… represented
enumerated sequences from the original sequence with ambiguous bases.</p>

<h2 id="results">Results</h2>

<p>The <a href="https://bitbucket.org/tomeraltman/covid-19-primer-data/src/master/reports/all-public-primers-qc-report.csv">comparison table of all primer-probe
sets</a>
versus a full set of tests for the forward primer, reverse primer, the
hybridization probe, and combinations thereof is available on the
<a href="https://tomeraltman.net/2020/03/11/COVID-19-candidate-primers-update.html">COVID-19 Primer Data BitBucket
repository</a>.
In the CSV file, values surrounded by “**” indicate a value that is
above or below a quality control threshold (i.e., a design flaw).</p>

<p>In the next sections I will describe design flaws detected, organized
by the origin of the primer-probe sets (in no particular order):</p>

<h3 id="cdc">CDC</h3>

<p>In my initial analysis of the CDC primers, I restricted myself to just
the primers targeting the SARS-CoV-19 genome. In this latest analysis,
I included the control primer-probe set targeting the RdRP gene:</p>

<ul>
  <li>The forward primer has five out of five of the 3’ bases as G/C (GC clamp)</li>
</ul>

<h3 id="china">China</h3>

<ul>
  <li>China-N: A high-temperature hairpin loop in the reverse primer</li>
  <li>China-N: A high-temperature hairpin loop in the hybridization probe</li>
  <li>China-N: The hybridization probe’s Tm is not at least 6 degrees Celsius higher than the primers’ Tm</li>
  <li>ORF1ab: The hybridization probe has a high-temperature hairpin loop</li>
</ul>

<h3 id="eu-drosten">EU-Drosten</h3>

<ul>
  <li>EU-Drosten-N’s reverse primer’s 3’ end GC clamp has four G/C bases in the last five base positions</li>
  <li>EU-Drosten-N’s hybridization probe has a high-temperature hairpin loop</li>
  <li>EU-Drosten-N’s probe Tm is not at least 6 degrees Celsius higher than the primers’ Tm</li>
  <li>EU-Drosten-RdRP-P1 and EU-Drosten-RdRP-P2: forward primer has five out of five 3’ bases as G/C (GC clamp)</li>
  <li>EU-Drosten-RdRP-P1 probe: 20 out of 32 sequences have high-temperature hairpin loops</li>
  <li>EU-Drosten-RdRP-P1: 28 out of 32 explicit primer-probe sets have less than 6 degrees Celsius difference between probe Tm and primer Tm</li>
  <li>EU-Drosten-RdRP-P2 probe: has a very high hairpin loop Tm of 79.14 degrees Celsius</li>
  <li>EU-Drosten-RdRP-P2: explicit primer-probe sets have less than 6 degrees Celsius difference between probe Tm and primer Tm</li>
</ul>

<h3 id="hong-kong">Hong Kong</h3>

<ul>
  <li>HKU-ORF-1b forward primer: 8 out of 16 explicit sequences have high hairpin loop Tm</li>
  <li>HKU-N forward primer: high hairpin loop Tm</li>
  <li>HKU-ORF-1b reverse primer: four out of five 3’ bases are G/C (GC clamp)</li>
  <li>HKU-N probe: First 5’ base is G (interferes with fluorescence)</li>
  <li>HKU-ORF-1b probe: 8 out of 16 explicit sequences have high temperature hairpin loops</li>
  <li>HKU-N probe: high temperature hairpin loop</li>
  <li>HKU-ORF-1b &amp; HKU-N: all explicitly primer-probe sets have less than 6 degrees Celsius difference between probe Tm and primer Tm</li>
</ul>

<h3 id="japan">Japan</h3>

<ul>
  <li>Forward primer has high-temperature hairpin loop</li>
</ul>

<h3 id="thailand">Thailand</h3>

<ul>
  <li>The probe Tm is not 6 degrees Celsius higher than the primers’ Tm</li>
</ul>

<h2 id="discussion">Discussion</h2>

<p>A thorough check for common design flaws among an international set of
primer-probe sets shows that many other countries have similar issues
as the CDC primer-probe sets. While there is no guarantee that a
theoretically-optimal primer-probe set will perform well /in vitro/
(and, conversely, there are many “flawed” primer probe sets in various
applications that have been observed to work well in practice), it is
inefficient to ignore these common pitfalls. This is because it only
takes seconds for tools like Primer3 to screen these primer-probe sets
for common flaws, while it takes significantly more time and resources
to perform similar tests at the wet bench. Especially bearing in mind
that these government labs are racing against time to develop
efficient and accurate primer-probe sets at scale for testing entire
populations against a rapidly-spreading pandemic, there should be no
tolerance for testing theoretically-flawed primer-probe designs, when
there are others free of such flaws available for experimental
validation. While not necessarily optimal (and definitely not
validated /in vitro/), I was able to generate such <a href="https://tomeraltman.net/2020/03/11/COVID-19-candidate-primers-update.html">primer-probe sets
targeting
SARS-CoV-19</a>
without these design flaws. Furthermore, all of the oligos in the
primer-probe sets that I generated are unambiguous; better not to
dilute one’s oligos needlessly with degenerate primer design. It is my
hope that going forward there can be standardization around the primer
design process, to allow these design flaws to be avoided
systematically by government labs worldwide.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Summary]]></summary></entry><entry><title type="html">Updated set of COVID-19 Candidate Primers</title><link href="https://tomeraltman.net/2020/03/11/COVID-19-candidate-primers-update.html" rel="alternate" type="text/html" title="Updated set of COVID-19 Candidate Primers" /><published>2020-03-11T00:00:00-07:00</published><updated>2020-03-11T00:00:00-07:00</updated><id>https://tomeraltman.net/2020/03/11/COVID-19-candidate-primers-update</id><content type="html" xml:base="https://tomeraltman.net/2020/03/11/COVID-19-candidate-primers-update.html"><![CDATA[<p>In the wake of <a href="https://tomeraltman.net/2020/03/03/technical-problems-COVID-primers.html">my previous blog
post</a>,
I have received numerous bits of advice on how to make my set of
COVID-19 candidate primers better. Today I am releasing my second
version of the candidate primers. Here is a summary of the updates:</p>

<ul>
  <li>
    <p>All data resources are being updated on a <a href="https://bitbucket.org/tomeraltman/covid-19-primer-data/src/master/">Git repository on
BitBucket</a>,
allowing easy sharing among bioinformaticists. Feel free to fork &amp;
send me pull requests to make this resources better.</p>
  </li>
  <li>
    <p><a href="https://bitbucket.org/tomeraltman/covid-19-primer-data/downloads/new-COVID-primer-set-stats_v2.xlsx">Latest spreadsheet report (version 2)</a> with recommended candidate primers
highlighted in green (marginal primers highlighted in yellow).</p>
  </li>
  <li>
    <p>Added many additional columns to the report to help in candidate
primer probe set selection, including %GC of the oligos, the melting
temperature of the oligos, and amplicon %GC and length (relative to
the reference RefSeq COVID-19 complete genome). Column ‘U’ in
particular reports the difference between the probe melting
temperature, and the highest melting temperature of the forward and
reverse primers. You want this difference to be larger than 6
degrees Celsius, to allow for efficient probe hybridization. I sort
the report first on the F1-score (column ‘M’, descending) and the
probe-primers melting temperature difference (column ‘U’,
descending), to make it easy to find the best primer-probe sets.</p>
  </li>
  <li>
    <p>Documentation on <a href="https://bitbucket.org/tomeraltman/covid-19-primer-data/src/master/doc/column-descriptions.md">the meaning of the
columns</a>
in the spreadsheet report.</p>
  </li>
  <li>
    <p>Added the <a href="https://bitbucket.org/tomeraltman/covid-19-primer-data/src/master/metadata/p3-new-COVID-19-parameters.txt">Primer3 parameter file</a>, so that everyone can see the
choice of parameter settings used for generating these primers. Of
course there’s no one-size-fits-all, as different kits will be run
on different machines, and thus there will be a need to modify the
parameters to fit the given protocol. I welcome feedback on how I
can create additional parameter files for various protocols.</p>
  </li>
  <li>
    <p>A <a href="https://bitbucket.org/tomeraltman/covid-19-primer-data/src/master/kit-primers/all-COVID-19-kit-primers.fasta">FASTA file of COVID-19 primer-probe sets</a> that are currenlty
deployed in testing kits.</p>
  </li>
  <li>
    <p>Fixing of the NCBI Blast database sequence identifier issue (see
original post)</p>
  </li>
  <li>
    <p>Improvements to the amplicon-calling code and other fixes to improve
performance</p>
  </li>
</ul>

<p>I hope this is another step towards making this work more useful and
accessible. Please let me know if you have any technical questions or
bug reports (code bugs or documentation bugs) by creating
an issue in the <a href="https://bitbucket.org/tomeraltman/covid-19-primer-data/issues?status=new&amp;status=open">Issue
Tracker</a>. For
all else, you can find me on Twitter or email me (see bottom of page).</p>]]></content><author><name></name></author><summary type="html"><![CDATA[In the wake of my previous blog post, I have received numerous bits of advice on how to make my set of COVID-19 candidate primers better. Today I am releasing my second version of the candidate primers. Here is a summary of the updates:]]></summary></entry><entry><title type="html">Technical Problems with Existing CDC COVID-19 Primers, and an Improved Set of Primers</title><link href="https://tomeraltman.net/2020/03/03/technical-problems-COVID-primers.html" rel="alternate" type="text/html" title="Technical Problems with Existing CDC COVID-19 Primers, and an Improved Set of Primers" /><published>2020-03-03T00:00:00-08:00</published><updated>2020-03-03T00:00:00-08:00</updated><id>https://tomeraltman.net/2020/03/03/technical-problems-COVID-primers</id><content type="html" xml:base="https://tomeraltman.net/2020/03/03/technical-problems-COVID-primers.html"><![CDATA[<h2 id="summary">Summary</h2>

<p>In this post I review technical problems with the CDC COVID-19 primers
and I describe how I generated a new set of primers and probes without
those problems.</p>

<p>Note that this was based on available outbreak whole-genome
sequence data obtained from the <a href="https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&amp;PAGE_TYPE=BlastDocs&amp;DOC_TYPE=Download">NCBI Blast NT
database</a>,
downloaded from NCBI on 2020-03-02. I will try to re-run this pipeline
every few days to update the set of candidate primers in light of
newly-available sequence data. Please check these primers yourself
before using them for the basis of any diagnostic kit.</p>

<p>This blog post describes technical problems with some of the COVID-19
primer-probe sets that are being promoted by the CDC to diagnose cases
of COVID-19 infections. Several of these primers have dimerization and
hairpin loop issues, among others. Here, I describe a bioinformatic pipeline to design better
candidate primers that pass stringent design criteria. Using this
pipeline, I generated a new set of primers and probes without the
aforementioned technical issues of the CDC COVID-19 primer probe sets,
and tested their <em>in silico</em> precision and recall using the most
recent set of COVID-19 complete genomes from NCBI. The ten best primer-probe pairs have perfect
classification performance (recall, precision, and F1-score all
1.0). I provide these <a href="https://bitbucket.org/tomeraltman/covid-19-primer-data/downloads/new-COVID-primer-set-stats_v2.xlsx">candidate primers</a> for free, under the
<a href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International (CC BY 4.0)
license</a>.</p>

<p>My hope is that this blog post will help start a conversation about the design
process for these critical diagnostic primers. Starting with
these primers, and with input from other scientists, we can develop a
set of candidate primers for COVID-19 that could be used by health
agencies and private diagnostic kit manufacturers alike to not only
generate more diagnostic kits, but <em>better</em> ones.</p>

<h2 id="primer-design-issues">Primer Design Issues</h2>

<p>As the COVID-19 pandemic spread from country-to-country over the past
month, I felt that I should try to help in some way. I had recently
been working on primer design as part of my consultancy, and I thought
that perhaps I can analyze the primers developed by the CDC and
others, in part to see how their primers stack up against the ones
that I had developed for a different project.</p>

<p>As others have noted, there are flaws in the <a href="https://www.cdc.gov/coronavirus/2019-ncov/lab/index.html">COVID-19 diagnostic
primers from the
CDC</a>. They
released four primer-probe sets; three that target the N gene, which
encodes for a nucleocapsid phosphoprotein, and one that targets a
human RNA polymerase gene (as part of laboratory controls).</p>

<p>Using <a href="https://primer3.org">Primer3</a>, a respected software program for
predicting performant primer probe sets using thermodynamic
calculations, I analyzed the three N gene primer probe sets. I found
the following problems with each set:</p>

<h1 id="n1">N1</h1>

<p>Here is primer probe set N1, shown in Boulder format (as used by
Primer3):</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SEQUENCE_PRIMER=GACCCCAAAATCAGCGAAAT
SEQUENCE_INTERNAL_OLIGO=ACCCCGCATTACGTTTGGTGGACC
SEQUENCE_PRIMER_REVCOMP=TCTGGTTACTGCCAGTTGAATCTG
</code></pre></div></div>

<p>Here is the formatted analysis summary from Primer3 for N1 (note that
in the terminology of Primer3, the forward primer is termed “Left”,
the reverse primer is termed “Right”, and the hybridization probe is
termed “Internal Oligo”):</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>OLIGO            start  len      tm     gc%  any_th  3'_th hairpin seq
LEFT PRIMER      28286   20   58.32   45.00    0.00   0.00    0.00 GACCCCAAAATCAGCGAAAT
RIGHT PRIMER     28357   24   62.50   45.83    2.43   0.00   52.19 TCTGGTTACTGCCAGTTGAATCTG
INTERNAL OLIGO   28308   24   69.19   58.33   13.48   0.00   42.14 ACCCCGCATTACGTTTGGTGGACC
SEQUENCE SIZE: 29902
INCLUDED REGION SIZE: 29902

PRODUCT SIZE: 72, PAIR ANY_TH COMPL: 0.00, PAIR 3'_TH COMPL: 0.00
</code></pre></div></div>

<p>The reverse primer was rejected by Primer3 due to a predicted hairpin
loop. These loop formations can cause problems during PCR, leading to
lower amplification efficiency. The Primer3 output for the hairpin is
as follows:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SEQUENCE_ID=CDC-N1-COVID-19
Reverse primer:
Tm: 52.2&amp;deg;C  dG: -1596 cal/mol  dH: -34200 cal/mol  dS: -105 cal/mol*K
        5' TCTGGTTA*
            ||||   |
3' GTCTAAGTTGACCGTC*
</code></pre></div></div>

<p>While lower than the default melting temperature cutoff, Primer3 found
more problems with the primer probe set. Namely, it can form
oligo-dimers with the reverse primer and the hybridization probe (with
themselves), and the hybridization probe itself can form a hairpin
loop:</p>

<p>Reverse primer self-dimer:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Tm: 2.4&amp;deg;C  dG: -5495 cal/mol  dH: -45600 cal/mol  dS: -129 cal/mol*K
 5' TCTGGTTACTGCCAGTTGAATCTG 3'
           |||| ||||
3' GTCTAAGTTGACCGTCATTGGTCT 5'
</code></pre></div></div>

<p>Hybridization probe self-dimer:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Tm: 13.5&amp;deg;C  dG: -4045 cal/mol  dH: -87400 cal/mol  dS: -269 cal/mol*K
5' ACCCCGCATTACGTTTGGTGGACC 3'
      || |   ||||   | ||
3' CCAGGTGGTTTGCATTACGCCCCA 5'
</code></pre></div></div>

<p>Hybridization probe 3’ hairpin loop:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Tm: 42.1&amp;deg;C  dG: -274 cal/mol  dH: -16800 cal/mol  dS: -53 cal/mol*K
       5' ACCCCGCA*
              ||  T
3' CCAGGTGGTTTGCAT*
</code></pre></div></div>

<h1 id="n2">N2</h1>

<p>Here is set N2:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SEQUENCE_PRIMER=TTACAAACATTGGCCGCAAA
SEQUENCE_INTERNAL_OLIGO=ACAATTTGCCCCCAGCGCTTCAG
SEQUENCE_PRIMER_REVCOMP=GCGCGACATTCCGAAGAA
</code></pre></div></div>

<p>Just from inspection, you can see that there is an issue with a poly-X
run of five ‘C’s in the hybridization probe. Poly-X runs of five bases
or longer are known to cause problems with non-specific priming
(possibly leading to false positive readings), and are normally avoided.</p>

<p>Here is the formatted analysis summary from Primer3 for N2:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>OLIGO            start  len      tm     gc%  any_th  3'_th hairpin seq
LEFT PRIMER      29163   20   58.75   40.00    0.00   0.00   34.76 TTACAAACATTGGCCGCAAA
RIGHT PRIMER     29229   18   60.10   55.56   12.76   0.00   36.75 GCGCGACATTCCGAAGAA
INTERNAL OLIGO   29187   23   68.15   56.52    8.63   0.00   49.09 ACAATTTGCCCCCAGCGCTTCAG
SEQUENCE SIZE: 29902
INCLUDED REGION SIZE: 29902

PRODUCT SIZE: 67, PAIR ANY_TH COMPL: 0.00, PAIR 3'_TH COMPL: 0.00
</code></pre></div></div>

<p>Primer3 found self-dimer and hairpin loop issues with all of the
oligos in the N2 primer-probe set, though at lower temeratures.</p>

<h1 id="n3">N3</h1>

<p>Here is set N3:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SEQUENCE_PRIMER=GGGAGCCTTGAATACACCAAAA
SEQUENCE_INTERNAL_OLIGO=ATCACATTGGCACCCGCAATCCTG
SEQUENCE_PRIMER_REVCOMP=TGTAGCACGATTGCAGCATTG
</code></pre></div></div>

<p>The forward primer, while having a C base in the last five base positions, has it at the 5’ end of that region, with a poly-A tail following. While runs of a single base of one to four base pairs in length are normally tolerated, it’s worrisome to place one at the 3’ end. It also had a predicted hairpin loop:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SEQUENCE_ID=CDC-N3-COVID-19
Reverse primer:
Tm: 53.8&amp;deg;C  dG: -1210 cal/mol  dH: -23500 cal/mol  dS: -72 cal/mol*K
   5' TGTAGCACG*
          |||  |
3' GTTACGACGTTA*
</code></pre></div></div>

<p>Here is the formatted analysis summary from Primer3 for N3:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>OLIGO            start  len      tm     gc%  any_th  3'_th hairpin seq
LEFT PRIMER      28680   22   60.53   45.45    0.00   0.00    0.00 GGGAGCCTTGAATACACCAAAA
RIGHT PRIMER     28751   21   61.50   47.62    0.00   0.00   53.84 TGTAGCACGATTGCAGCATTG
INTERNAL OLIGO   28703   24   67.77   54.17    0.00   0.00   41.98 ATCACATTGGCACCCGCAATCCTG
SEQUENCE SIZE: 29902
INCLUDED REGION SIZE: 29902

PRODUCT SIZE: 72, PAIR ANY_TH COMPL: 0.00, PAIR 3'_TH COMPL: 0.00
</code></pre></div></div>

<h2 id="developing-new-candidate-primers">Developing New Candidate Primers</h2>

<h1 id="primer-design-criteria">Primer Design Criteria</h1>

<p>In order to design better-performing COVID-19 primers, I developed a
bioinformatic pipeline that generates primers using the following
criteria:</p>

<ul>
  <li>Target regions conserved perfectly among all COVID complete genomes
(41 downloaded on 2020-03-02; check <a href="https://www.ncbi.nlm.nih.gov/nuccore/?term=txid2697049%5BOrganism%3Anoexp%5D+%22complete+genome%22">NCBI Nucleotide</a> for the current list)</li>
  <li>Poly-X runs longer than four bases are not allowed</li>
  <li>Check that hybridization probes do not start with G (avoid quenching)</li>
  <li>Check for GC clamp on the forward and reverse primers (3’ end has
one or two G’s or C’s in last five base pairs)</li>
  <li>GC% range between 40 and 60, with an optimum at 50</li>
  <li>Primer and probe size range between 18 to 27 bases</li>
  <li>Amplicon size from 75 to 200 bases</li>
  <li>Disallow any hairpins, self-dimers, or oligo-interactions at ANY temperature</li>
  <li>Exclude regions that are identical to <a href="https://www.cdc.gov/coronavirus/general-information.html">the four known common cold-causing,
human-associated coronaviruses(229E, NL63, OC43, and HKU1)</a></li>
  <li>Primers designed to cover the viral RNA polymerase gene, as it tends
to be highly conserved within an RNA viral species, and different
from the RNA polymerase sequence of different RNA viral species
(as per Tom Slezak, former head of biodefense program at
 LLNL, private communication)</li>
</ul>

<h1 id="the-new-candidate-primers-files">The New Candidate Primers Files</h1>

<p>The set of candidate primers (last generated on 2020-03-11) can be downloaded
<a href="https://bitbucket.org/tomeraltman/covid-19-primer-data/downloads/new-COVID-primer-set-stats_v2.xlsx">here</a> in XLSX spreadsheet
format, and <a href="https://bitbucket.org/tomeraltman/covid-19-primer-data/downloads/new-COVID-primer-set-stats_v2.csv">here</a> as a
comma-separated value (“CSV”) file. These are released as
<a href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International 4.0 (CC BY
4.0)</a>. I recommend
focusing on the top ten primer-probe sets, as they have the greatest
performance (based on the F1-score).</p>

<p>Below find selected details about how the primers were designed and
validated using an ‘e-PCR’ approach.</p>

<h1 id="primer-performance-validation-in-silico">Primer Performance Validation <em>in silico</em></h1>

<p>Even if you design your primers using recommended settings with
trusted software like Primer3, there’s still the chance that you’ll
have off-target binding of the oligos to other stretches of DNA in
your sample, or even off-target amplicons if a forward and reverse
primer bind close enough to one another in the correct
orientation. For that reason, I analyzed the <em>in silico</em> performance
of the newly designed primers to determine their predicted recall and
precision. To do so, I performed a sequence homology search using NCBI
Blast+ (version 2.10.0) against the NT database (downloaded
2020-03-02). Complete genomes (whether prokaryotic or viral) where the
forward and reverse primers, and the hybridization probe, all had
gapless alignments with 90% or greater sequence identity, and in the
proper orientation, were counted as a hit. I used all complete viral
genomes in NT with NCBI Taxonomy Database identifier 2697049 (the
identifier for the COVID-19 sequences) as the “gold standard” for
assessing the performance of these primer probe sets as diagnostics
for COVID-19 presence (i.e., these are the sequences that the primers
should hit without fail).</p>

<p>The spreadsheet shows that many of the newly-designed primers have five false negatives, but
this is actually a bug in my current pipeline. It’s due to an obscure
detail about how NCBI represents identical genomes in its Blast
databases. Those genomes are all identified, just under a different
identifier. So the top ten primer-probe pairs actually have perfect
performance: precision, recall, and F1-score all 1.0.</p>

<h2 id="next-steps">Next Steps</h2>

<p>Thank you for reading this far. My hope is that I can start up a
conversation among scientists about how to design and validate
candidate primer probes /in silico/, to allow for faster and more
efficient wet-bench validation of the candidate primers. I hope this
will lead to better, more accurate COVID-19 diagnostic kits being
designed and successfully deployed around the country.</p>

<p>Please feel free to follow up with me via email
(blog-at-me.tomeraltman.net) or Twitter
(<a href="https://www.twitter.com/tomeraltman">@tomeraltman</a>). I look forward
to constructive feedback about how to improve this analysis. I am
working on releasing the source code for the bioinformatic pipeline,
and writing up the methodology in more detail.  Thanks again for your
time, and thanks in advance for any help that you might provide.</p>

<p>TODOs:</p>
<ul>
  <li>Perform same analysis on WHO recommended COVID-19 primers</li>
  <li>Release software pipeline code</li>
  <li>Complete secondary structure code blocks for low-temperature entries</li>
  <li>Fix false negative reporting bug due to NCBI Blast databases
<strong>[FIXED: see update post]</strong></li>
  <li>Prepare technical manuscript detailing the methodology</li>
</ul>

<h2 id="acknowledgments">Acknowledgments</h2>

<p>My deepest gratitude to the following individuals for helping me by
reviewing this post:</p>

<ul>
  <li><a href="https://www.kpathsci.com/company">Tom Slezak</a></li>
  <li><a href="https://microbiomedigest.com/sample-page/in-the-news/">Dr. Elisabeth Bik</a></li>
  <li><a href="https://www.linkedin.com/in/michael-walker-90aa2452/">Dr. Michael Walker</a></li>
  <li><a href="https://www.linkedin.com/in/david-dill-06b1524">Dr. David L. Dill</a></li>
</ul>

<h2 id="updates">Updates</h2>

<ul>
  <li>
    <p><strong>[2020-03-06]</strong> Incorporated suggestions and fixes from Nathan Walsh
and Nora Callahan. Thanks!</p>
  </li>
  <li>
    <p><strong>[2020-03-11]</strong> Released latest version of COVID-19 candidate
  primers. Updated file links. See <a href="https://tomeraltman.net/2020/03/11/COVID-19-candidate-primers-update.html">this update post</a> for more
  information.</p>
  </li>
  <li>
    <p><strong>[2020-03-12]</strong> The formatted analysis summary for N3 was
  accidentally copied from N2. Fixed.</p>
  </li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[Summary]]></summary></entry><entry><title type="html">Hello, World!</title><link href="https://tomeraltman.net/meta/2016/03/15/hello.html" rel="alternate" type="text/html" title="Hello, World!" /><published>2016-03-15T22:26:35-07:00</published><updated>2016-03-15T22:26:35-07:00</updated><id>https://tomeraltman.net/meta/2016/03/15/hello</id><content type="html" xml:base="https://tomeraltman.net/meta/2016/03/15/hello.html"><![CDATA[<h1 id="welcome">Welcome!</h1>

<p>This is my <em>personal</em> blog, where I hope to share ideas and
hacks that others might find helpful. You can find me around the web
on GitHub, BitBucket, and LinkedIn. Thanks for stopping by!</p>]]></content><author><name></name></author><category term="meta" /><summary type="html"><![CDATA[Welcome!]]></summary></entry></feed>