Biowulf High Performance Computing at the NIH
Guide to DNAWorks version 3.2

Input

Job Name

Enter a job name. Make sure the name is a single string only using characters [a-z], [A-Z], [0-9], and '_' (underscore). The name is useful for keeping track of old runs if you want to go back to them.

Email Address

An email address is a good idea if your job will take a long time or if you don't care to watch the slow progress of the output page. It is not optional, and is not stored or sold to evil third parties...

Mutant Run

After spending so much time getting a synthetic gene put together, wouldn't it be simple to make 1-3 new oligos for each site-directed mutation and be sure the new oligos will not create problems in the PCR? Clicking on 'Enable mutant run' will elminate all the parameters from the input page, as these will be taken from the original logfile.

Enter a job name, email address (optional), the mutated sequence(s) (make sure they are the same length(s) as the original sequence(s)), the original logfile and trial number (used for original gene synthesis). The parameters will be set to the same as that of the trial number from the original logfile. Once everything is entered, clicking "Design Oligos" will generate the replacement oligos, along with an evaluation of scores for the mutated sequence. The mutation is printed in lowercase font, and it is highlighted in the oligonucleotide assembly.

When creating mutants, look through the new logfile and make sure there are no radical changes in the scores and, most importantly, the Tm histogram. And always make sure that the mutation you designed is what you expected!

If a protein or nucleotide sequence has never been synthesized, don't worry about the Mutant Run mode.

Codon Frequency Table

In order to reverse translate any protein sequences into nucleotide sequences, a codon table is required. Further, in order to make sure that the highest frequency codons are used and/or the lowest frequency codons are avoided in translating protein sequences, you will need to include frequencies for these codons.

The Codon Frequency Table can be input in three distinct ways:

Choose a standard organism

Several commonly used organisms are already entered in a list box for simple access to the program. The codon frequencies for these organisms are based on the number of times each codon is found in protein coding regions of the respective organism's genome. In the case of E. coli, however, there are frequencies based all genes (standard), and for genes that are expressed at high levels during exponential growth, as determined by the factorial correspondence analysis (Medigue et al., 1991) (class II).

Enter codon frequencies manually

DNAWorks requires the GCG format of codon frequencies. The codon frequency table should be input as five columns as shown below. The data represent the residue in three letter code, the codon triplet (in DNA, not RNA), the number of codons in the dataset, frequency per thousand, and the fraction used. It is not necessary to align the column fields, as long at least a single space separates the fields.

   Gly   GGG   40359.00   11.39   0.16
   Gly   GGA   34894.00    9.85   0.13
   Gly   GGT   89915.00   25.37   0.35
   Gly   GGC   94608.00   26.70   0.36
   Glu   GAG   66665.00   18.81   0.33
   Glu   GAA  137748.00   38.87   0.67
   ...

Make sure the three-letter code is correct as well, otherwise the program may not interpret the sequence correctly:

   Ala = alanine        Ile = isoleucine     Arg = arginine
   Cys = cysteine       Lys = lysine         Ser = serine
   Asp = aspartate      Leu = leucine        Thr = threonine
   Glu = glutamate      Met = methionine     Val = valine
   Phe = phenylalanine  Asn = asparagine     Trp = tryptophan 
   Gly = glycine        Pro = proline        End = stop 
   His = histidine      Gln = glutamine      Tyr = tyrosine

The program requires 64 entries in the table for all combinations of A, G, C, and T in codon triplets. Tables entered with fewer (or none) than 64 codons will cause the program to fail, and any more than 64 will be rejected. The order of codons in the table is not important.

Upload Codon Frequency Table File

A file containing codon frequencies can be uploaded rather than manually entered. All format rules for manually entered tables (above) also apply for uploaded files.

Parameters

Annealing Temperature

The annealing temperature parameter sets an ideal annealing temperature for a set of synthetic oligonucleotides. At this temperature, under normal PCR conditions (ionic strength ~100 mM, [Mg2+] = 1-4 mM), all of the oligonucleotides will anneal and assemble cooperatively. The uniformity of annealing temperatures prevents mispriming and/or lack of priming prior to the elongation step, and helps to assure a single uniform PCR product.

As explained above, the annealing temperature and oligonucleotide length are directly correlated. DNAWorks will, however, favor the annealing temperature above the oligo length. Thus, the range of annealing temperatures in a set of synthetic oligonucleotides will always remain much smaller than the range in oligonucleotide lengths, and while oligonucleotide lengths may exceed the input maximum oligo length parameter, the annealing temperatures will always be kept close to the input value.

A range of annealing temperatures can be sampled by inserting a second temperature in the 'to - °C' box.

The range of annealing temperatures accepted is between 58 and 70°C.

Oligo Length

The oligo length parameter provides a limit to the length in nucleotides any one of a set of synthetic oligonucleotides can attain. The synthesis of oligonucleotides is subject to errors, mainly deletions, but occasionally mismatches and insertions. The error rate of oligonucleotide synthesis is primarily dependent on length; longer oligonucleotides tend to have more errors (although operator methodology can play a strong role as well -- DNAWorks cannot address technical sloppiness!). To minimize the number of errors in synthetic genes, it is best to keep the oligonucleotide lengths to a minimum.

The oligo length value is directly correlated to the annealing temperature; higher annealing temperatures will result in longer oligonucleotides. Also, to maintain high affinity between oligonucleotides, the oligonucleotides must be long enough to provide decent overlap. Thus, although most program executions with reasonable parameters will result in a set of oligonucleotides whose lengths match or are below this value, the attainment of the desired length is not guaranteed.

A range of oligo lengths can be sampled by inserting a second temperature in the 'to - nt' box.

The range of oligo lengths accepted is between 30 and 999 nt.

Oligos are permitted to have gaps between the overlap regions.

Clicking the 'random' checkbox will allow the oligos to vary in length between the minimal necessary and the length chosen.

Codon Frequency Threshold

The level of protein expression depends to some degree on the availability of tRNAs to the growing polypeptide chain on the ribosome. Codons used infrequently often have low levels of their cognate tRNAs. This phenomenon of "codon bias" has been shown to be the case for Escherichia coli expression. Thus, by using only the most frequent codons, the availability of tRNAs ceases to be an issue in protein expression levels.

Further, the high G+C content of the genomes of certain organisms creates problems in cloning genes from these organisms. By optimizing codon bias using equally mixed A+T/G+C codons, the problems involved in cloning are completely avoided.

The codon frequency threshold parameter sets a cutoff for which codons to be used for reverse translation of protein sequences into DNA. For example, a value of 20 will allow only those codons whose frequencies equal or exceed 20% to be used in reverse translation and optimization.

Random / Strict / Scored

DNAWorks always uses the highest frequency codons for the initial reverse translation of the protein sequence into a nucleotide sequence. This typically leads to a faster convergence. Checking the 'Random' box will cause the program to choose a random codon from those available and which code for protein residues.

By default, DNAWorks always uses the two highest frequency codons for optimization. To override this default, checking the 'Strict' box will force the program to strictly use only those codons that are within the chosen Codon Frequency Threshold. Be careful, because setting a high Codon Frequency Threshold (>20%) and Strict will result in many protein residues with a single codon available, and thus very little room for optimization.

To accelerate convergence, DNAWorks does not continuously score codon frequency. This is allowed because only the highest frequency codons are usually used. However, for the particularly picky user, checking the 'Scored' box will force the program to continuously evaluate the codon frequency score. This will have the effect of increasing the overall frequency of codons (at the cost of other scores...).

Oligonucleotide / Monovalent Cation / Magnesium Concentration

The concentration of oligonucleotides, monovalent cations (Na+, K+), and magnesium in the PCR reaction can have profound effects on the annealing temperatures of the oligonucleotides. The user can enter the desired concentrations for the PCR reaction.

The effects of these components on the annealing temperature is based on the program HyTher (Nicolas Peyret, Pirro Saro and John SantaLucia, Jr.).

Values are in moles per liter, and can be entered in scientific notation for simplicity.

Oligonucleotides must be between 100 um (1E-4 M) and 1 nm (1E-9 M), monovalent cations must be between 10 and 1000 mM, and magnesium must be between 0 and 200 mM.

Number Of Solutions

DNAWorks uses a random number generator and simulated annealing to optimize multiple parameters simultaneously during execution. Thus, multiple runs will generate solutions of varying levels of success. While smaller genes (proteins of less than 100 amino acids) may not require more than one run to generate a satisfactory set of oligonucleotides, longer genes benefit from multiple runs. The number of solutions parameter will set the number of oligonucleotide sets generated during execution.

The maximum number of runs per job is 999.

Thermodynamically Balanced Inside-Out Mode Output

The method of gene synthesis employed by DNAWorks is termed "thermodynamically balanced", in that all the oligonucleotides should assemble and anneal at the same temperature. The amplification occurs everywhere at once, and ideally can generate the gene with just one round of PCR. However, there are sticky cases where the gene does not amplify, and constructing the gene in pieces is not successful.

A more controlled method of gene synthesis, termed "thermodynamically balanced inside-out", was developed for cases where problems occurred during PCR synthesis (Gao, et al., 2003). In an assembly set of oligonucleotides, the first half of the oligos are all synthesized in the sense orientation, and the other half are synthesized as reverse complements in the anti-sense orientation of the gene. The gene assembly and amplification is thus done in steps of 0.4-0.6 kb from the center pair of oligonucleotides outward.

Checking the 'TBIO' box will enable thermodynamically balanced inside-out output.

No gaps in assembly

By default, DNAWorks will try to keep all oligos the same size as the chosen length. If the size is beyond the sizes required for the chosen Tm, gaps are introduced between overlap regions. Clicking on the 'no gaps in assembly' checkbox will keep oligos as short as possible, with no gaps between the overlap regions. Restricting oligos to no gaps may slow down the optimization somewhat, and may result in higher scores due to a higher probability of misprimes.

Advanced Features

Restriction Site / Custom Site Screen

Restriction sites can be excluded from the protein coding region of the synthetic gene. Further, custom sequences can be excluded from the protein coding region as well. The sites can be represented in degenerate code to allow for multiple specificity restriction endonucleases:

   K = G or T         M = A or C
   R = A or G         Y = C or T
   W = A or T         S = C or G
   B = C or G or T    V = A or C or G
   D = A or G or T    H = A or C or T
        N = A or C or G or T

There are 180 restriction sites available in list box format, partitioned into non-degenerate and degenerate sequences. Multiple restriction sites can be entered. The restriction sites are limited to 5 nucleotides or longer, and all are available from New England Biolabs.

The custom site screen allows unlisted restriction sites and novel sequences to be excluded from the protein coding region. The format for each site is a name for the site followed by the sequence:

   SiteName1 ATGCAT
   SiteName2 CCANNBNNGGT
   ...

Weights

DNAWorks optimizes a synthetic gene by evaluating the scores of a set of features: annealing temperature (T), codon frequency (C), repeat (R), misprime potential (M), GC- (G) and AT- (A) content, length (L), and pattern constraining (P). The default weights of each individual feature score are set to 1. By increasing the weight of an individual feature, the final output can be nudged to favoring one feature over the others. For example, in the case where the potential synthetic genes for a set of sequences chronically suffers from high number of repeats, increasing the weight of the repeat score (RWT) might decrease the final repeat score at the expense of the other feature scores.

Beware, as modulating the weights is not fully tested. Remember that this merely skews the results toward one feature or another, and may do more harm than good. In most cases keeping the weights balanced is the best approach.

Sequence(s)

DNAWorks can generate synthetic genes for both protein and nucleotide sequences in any mixture. Nucleotide sequences, however, can not be silently mutated, and thus the only options available for optimization are overlap positions. DNAWorks has no way of discriminating protein from nucleotide sequence, so a type must be chosen from the text box.

Sequences can be entered as text, or uploaded as files. Sequences can be entered as any of the following formats:

Raw           Plain         EMBL          Swiss-Prot    GenBank
PIR           ASN.1         FASTA         FASTA-old     NBRF
NBRF-old      IG/Stanford   IG-old        GCG           PHYLIP
PHYLIP-Int    PHYLIP-Seq    ClustalW

Sequences are inserted sequentially. By clicking on 'Add Sequence Field', more than one sequence can be added.

By checking the 'reverse sequence' box, the reverse complement of a sequence will be entered.

By checking the 'fix sequence in gap' box, the program will attempt to keep the sequence within a gap (the single stranded region between overlaps). Gap fixed sequences should be kept as small as the expected gaps, which depend on the melting temperature and length chosen. The lower the melting temperature and the longer the length, the larger the gaps will be, and the more room there is for gap-fixed sequences.

The typical reason for fixing a sequence in a gap is to allow the sequence to be swapped easily by a single oligo later on, as in saturation mutagenesis experiments, and to eliminate problems that would occur with random mutagenesis.

Output

Protein Sequence(s)

All protein sequences entered will appear in this section, along with residue numbering and the 'REVERSE' option if chosen. The sequences are presented in the order that they are incorporated into the synthetic gene, and the number of the sequence is given in the section header.

The protein sequence sections are not displayed in DNA only runs.

Codon Frequency Table

This table presents the input codons, along with the amino acid residue and codon frequency entered, in a standard 4x4 format. If a standard organism was chosen, the name of the organism appears in the header line. Otherwise it might say 'USER INPUT'.

The codon frequency table is not displayed in DNA only runs.

Active Codons

This table shows (in order from left to right) the status of codon sequences for all amino acid residues:

The active codons table is not displayed in DNA only runs.

Sequence Patterns

Any restriction sites or user-defined sequence patterns that are to be excluded (or at least attempted to be excluded) are displayed in the sequence pattern section.

Solution(s)

The next sections are the final fruits of the optimization. Pay close attention to the output in the following sections.

Parameters For Trial

These are the input parameters for the individual optimization trial.

DNA Sequence

This is the final, complete and 5' -> 3' DNA sequence of the synthetic gene. This is the section that you can cut and paste into other programs for further analysis and display. Simple enough.

Oligonucleotide Assembly

The output of each trial has a diagram of the oligos as they would join together to form the synthetic gene. Arrows display the direction of polymerization, and lines demarcate 10 nt intervals. The oligos are alternately displayed in upper and lower case. The translated protein sequence appears below the assembled sequence. The oligos are numbered from 5' to 3'. This simplifies finding oligos to generate fragments of the synthetic gene.

Here is a sample output with a repeat present:

 The oligonucleotide assembly is:
 ----------------------------------------------------------------
     1       10        20        30        40        50        60
     |        |         |         |         |         |         |

     1 --->                          3 --->
   1 ATGGCGCATCATCACCACCATCATGCC     cgttggcccggaacgccgcctgctggcc
      ACCGCGTAGTAGTGGTGGTAGTACGGGCACGGCAACCGGGCCTTGCGGCG     ccgg
                                                 <---  2
      M  A  H  H  H  H  H  H  A  R  A  V  G  P  E  R  R  L  L  A

     |        |         |         |         |         |         |

                           5 --->
  61 gtgtatacgggcggtaccattgGTATGCGCTCTGAGTTAGGCGTTCTGGTGCCAGGCACC
                       ***********                                < repeat?
     cacatatgcccgccatggtaaccatacgcgagactcaatccgcaagaccacg TCCGTGG
                                                  <---  4
      V  Y  T  G  G  T  I  G  M  R  S  E  L  G  V  L  V  P  G  T  

Internal repeats, potential misprimes, GC- and AT-rich regions, and mutated sequences are highlighted within the sequence for user inspection.

Scores

The final scores for the invidual trials are displayed in this section. The 'best' is zero, but don't be lured into the fantasies of computational biology. There are no absolute values that can be correlated with the success of gene synthesis. A rule of thumb is 'lower is better', but judge for yourself. See below, Hints and Suggestions.

Frequency Report

This table gives a summary of the number of times a given codon was used in the protein sequence sections. It is similar to the Codon Frequency Table (see above), except that a fourth value is displayed.

The frequency report is not displayed in DNA only runs..

Histograms

There are four histograms that are displayed for each individual trial.

The frequency histogram shows the number of codons in the protein sequences that fall within a range of frequencies. It then summarizes the total number of codons used.

The Tm histogram shows the number overlaps that fall within a range of annealing temperatures. It then summarizes the total range of annealing temperatures for all overlaps. This value should be as low as possible.

The overlap length histogram shows the number of oligos that have overlaps within a length range. It gives the length of the shortest overlap in nucleotides. Make sure this value is greater than 12, otherwise you might get non-specific annealing.

The length histogram shows the number of oligos that fall within an overall length range. It is better to have shorter oligonucleotides, but sometimes longer is just more practical for higher annealing temperatures and more specificity.

In the case of sequence segments that are gapfixed, a fifth histogram showing gap lengths is generated. This is mostly to see the feasibility of fixing sequence segments in gap regions. If the gaps are too small, it is highly unlikely that the segment wil fit in the gaps.

Sequence Patterns

This section shows all sites that were meant to be excluded, but got through the optimization anyhow. The table shows:

Oligonucleotide Sequences

Finally, what you wanted in the first place! This is a list of the sequences (5' -> 3') of the oligo nucleotides necessary to generate the synthetic gene. Neat and simple, preceded by their number (from the oligonucleotide assembly), trailed with the length of each oligo.

If you like the trial, then just cut and paste into any order form and you're all set.

Final Summary

If multiple trials were run, then this shows a summary of all the trials and a simple way to choose the 'best' solution. The table shows:

Errors

Input Errors

Errors... what errors?

Program History

Program Description

DNAWorks is a computer program that automates the design of oligonucleotides for gene synthesis by PCR-based gene assembly. The program requires simple input information: an amino acid sequence of the target protein or a DNA sequence, and a desired annealing temperature. Additionally, codons can be optimized for an organism of choice, sequences (such as restriction sites) can be excluded from the protein coding region, and flanking sequences (for subsequent cloning or integration) can be added to the protein coding region. The program outputs a set of oligonucleotide sequences with highly homogeneous annealing temperatures, minimal size, and low tendencies for hairpin formation and mispriming by both short and long range repeats. With the help of DNAWorks and a two-step PCR method, synthetic genes of up to 3000 basepairs can be successfully constructed.

The original description of DNAWorks and the method of PCR-based gene synthesis can be found in our publication (Hoover & Lubkowski, 2002)

How does the program work?

At the beginning of the program, an initial gene is constructed by reverse translating the protein sequence and joining the 5' and 3' flanks to the reverse translation. Codons are chosen randomly from the set of codons that are above the codon frequency threshold value. If a threshold value of 100 is given, all codons are included in the set, regardless of frequency. This generates a single DNA sequence. In the case of an input DNA sequence, no codons are required, and the sequence is taken as is.

Once a DNA sequence is obtained, the gene is broken into overlaps and oligos. This process of overlap generation is described in our paper, with the following exception. In the original version of DNAWorks, all the overlaps were contiguous so that the oligos are as small as possible. In the current version of DNAWorks, gaps between overlaps are allowed, giving larger oligos but also reducing the cost of oligo synthesis (fewer oligos needed) and simplifying subsequent site-directed mutations (if desired). Because the first and last overlaps can shift (giving a 5' or 3' overhang of 0 - n, with n being the length of the first or last overlap in nucleotides, respectively), there are multiple sets of overlaps, and hence multiple sets of oligos possible. All possible sets of oligos are evaluated and the best set is chosen, as determined by scoring.

An alternative mode of oligo design, termed "thermodynamically balanced inside-out", was developed for cases where problems occurred during PCR synthesis (Gao, et al., 2003). In an assembly set of oligonucleotides, the first half of the oligos are all synthesized in the sense orientation, and the other half are synthesized as reverse complements in the anti-sense orientation of the gene. This was found to improve control and reliability of gene synthesis by stepwise PCR. This mode can be toggled by clicking on the labeled box in the parameter section.

A gene is scored based on a set of features that are critical in the gene synthesis procedure.. The sequence features evaluated in determination of a sequence score are melting temperature (overlap alone), hairpin formation potential (an oligo sequence vs. itself), misprime potential (an overlap sequence vs. the entire sequence), length (of the oligo sequences and the overlap sequences), repeat (the entire sequence vs. itself), GC content (the entire sequence vs. itself), AT content (the entire sequence vs. itself), codon frequency (codon alone) and the presence of restriction sites or sequence patterns (the entire sequence vs. itself). These features are discovered, evaluated, and scored based on absolute values (melting temperature, codon frequency, length), sequence identity (repeat, GC and AT cotent, restriction sites and sequence patterns) and empirical formulas (hairpin, misprime). The scores are applied to the sequence, such that the regions that have unwanted features (potential misprime site, GC rich, Tm outside the desired range, etc.) have the highest local score. The total score is the sum of all local scores. The melting temperature is calculated using the equations of SantaLucia and Hicks. The names of the restriction sites to choose from are entirely from New England Biolabs.

In the case of an input DNA sequence, mutations are not possible, and only the best set of oligos resulting from the overlap generation step are output. Sequences which code for proteins are degenerate, however, and an almost infinite combination of codons is possible for a single sequence. Thus the optimization of a coding sequence is run using a simulated annealing (Metropolis) protocol, which can find a global optimum without needing to evaluate all possible local optima, and in a greatly accelerated fashion.

The gene is silently mutated (codon swap) at a single position. The position is chosen based somewhat on its local score (low frequency, GC rich region, repeat, etc), but somewhat randomly. Then the total score is calculated for the gene. This is a mutation round. If the mutation lowers the score, or if the "temperature" (in simulated annealing) is high enough, it keeps the mutation. Otherwise it reverts back to the original codon. Every mutation round another single silent mutation is generated and evaluated. When enough mutation rounds have gone by and the total score hasn't dropped (arbitrarily set to 6000), the program exits, and the final set of oligonucleotides is printed.

Typically a gene will optimize very quickly (within the first 500 mutation rounds), but much smaller drops will continue for a while. Short simple sequences will drop to zero, and will exit before the 6000 rounds are up. Longer, more complicated sequences will drop in score more gradually, and tend to drag on for much longer before the arbitrary number of rounds finish. The value of 6000 for the cutoff was chosen because otherwise the program would keep churning with smaller and more insignificant drops in score for a very long time, long past the hour cutoff time.

Once the final set of oligos is completed, the program outputs the results. If multiple solutions were requested (Number of Solutions), then the protein sequence is reverse translated as before, generating a new set of oligos, and the process is repeated for each solution. The results are printed to a plain text file that can be emailed to the user, or accessed via the web.

What's new in version 3.2?

Realistic handling of degenerate sequences

Degenerate sequences are allowed in DNAWorks, as some experiments call for targeted, randomized nucleotides. In previous versions, DNAWorks would simply create a completely new sequence for every nucleotide variation. This caused massive slowdowns in calculations, and didn't really give a realistic expectation of results.

This has been fixed so that degenerate sequences are scored in average, rather than independently. Then the oligo sequences are printed with the degenerate sequence intact. Also, degenerate sequences are now restricted to gap regions only. This is necessary, because it is impossible to control PCR when degenerate sequences appear in overlap regions.

Because DNAWorks restrains sequences based on scores, rather than constrains, there is a possibility that a degenerate sequence will appear in overlap regions. If this occurs, a warning message will appear in the output, and the sequence will be flagged in the assembly block. Keep an eye out for this.

What's new in version 3.1?

Fix sequences in gap regions

If, after generating a synthetic gene, multiple mutations of a small segment (e.g., 1-10 nt) are desired, it would be nice to simply create a library of oligos, and swap out a single oligo from the set to create the mutation. This can be done if the mutation lies in a gap region of the assembled gene. By breaking the small sequence segment into its own element, the segment can be fixed in a gap. The program will do its best to design the oligos such that the chosen segment lies within a gap. This can be done by clicking on the 'fix sequence in gap' checkbox.

Checking the fixed gap property will enable GapFix scoring. Any sequence that is designated as gapfixed will be penalized if it appears outside of any of the gaps or the 5'/3' overhangs. The weight of the GapFix score can be modulated with the FWT value (see Weights in Input/Advanced Features).

Keep in mind that gaps are usually not very long -- the lower the Tm and the longer the oligo length chosen, the longer the gaps will get. Thus only very small segments should be fixed into the gap regions.

By default, all oligos are designed to be the same size as the chosen length. However, this can interfere with gap fixed segments. To increase your chances of a successful run, you may want to try randomizing the oligo length (see below).

Random oligo lengths

By default, an attempt is made to force all oligos to be the same size as the chosen length. On occasion this can lead to a higher probability of misprimes. Also, this can limit successful optimization when sequences are gapfixed (see above), since gap position and size will be limited. In this case, enabling the length directive random causes oligos to be designed with random length (between 20 nt and the length chosen).

No gaps allowed in assembly

By default, DNAWorks will try to keep all oligos the same size as the chosen length. If the size is beyond the sizes required for the chosen Tm, gaps are introduced between overlap regions. The directive nogaps will keep oligos as short as possible, with no gaps between the overlap regions.

Restricting oligos to no gaps may slow down the optimization somewhat, and may result in higher scores due to a higher probability of misprimes.

Dynamic histograms

Histograms printed in the logfile are now dynamic. That is, their ranges will vary depending on the output of the run. This should allow for better understanding of the success of a solution at a glance.

Explicit sequence segment display

Previously, the logfile only displayed the individual protein sequences at the start of a solution run, with nucleotide sequences only appearing in the translated product. Now all sequence segments are displayed, along with their types and properties.

What's new in version 3.0?

Individual sequence segments/elements

Previous versions of DNAWorks were limited to a single protein chain, optionally surrounded by flanking nucleotide sequences. In version 3.0, sequence elements of either nucleotide or protein can be assembled in any order or quantity. Further, the sequences can be reversed. The web interface allows one sequence segment by default, but as many as 99 sequence elements can be added by clicking on the "Add Sequence Field" button at the bottom of the sequence textbox.

Weights

A synthetic gene design is optimized by continuously scoring properties of the gene (probability of mispriming, repeats, Tm, length, etc.) and randomly modifying the elements of the gene that can be randomized (overlap position, codon usage, etc.). The scoring algorithms are somewhat arbitrary, and results are not always expected.

For the very expert user, the individual weights of the scoring algorithms can be modified. Thus, a single score can be turned off by setting the weight to zero, or the importance of a set of scores (GC, AT, repeat, for example) could be increased relative to the other scores by a factor of 10. WARNING: this is still an experimental feature, and should be left alone by all the most expert user.

What's new in version 2.4?

Mutant sequence evaluation/entry

After spending so much time getting a synthetic gene put together, wouldn't it be simple to make 1-3 new oligos for each site-directed mutation and be sure the new oligos will not create problems in the PCR? Well, now you can! Clicking on "mutant sequence" will display the entry form for doing just that. Enter a job name, the mutated sequence (make sure it is the same length as the original sequence), the original logfile and trial number (used for original gene synthesis). The parameters will be set to the same as that of the trial number from the original logfile. Once everything is entered, clicking "Design oligos" will generate the replacement oligos, along with an evaluation of scores for the mutated sequence. The mutation is printed in lowercase font, and it is highlighted in the oligonucleotide assembly.

When creating mutants, look through the new logfile and make sure there are no radical changes in the scores and, most importantly, the Tm histogram. And always make sure that the mutation you designed is what you expected!

What's new in version 2.3?

GC content scoring

Stretches of high GC content can create deleterious secondary structures, inhibiting the PCR. The latest version monitors for stretches of GC-rich regions of 8 nucleotides or longer and attempts to eliminate them through codon variability.

Length scoring

While the melting temperatures of the overlaps and the length of the oligonucleotides is directly correlated, the lengths of the oligonucleotides can be modulated to some degree by codon variability. The program will attempt to keep the oligonucleotides from becoming longer than the desired length.

The length score will dominate the solution score in cases where a high Tm and a low oligo length are combined. Thus it would be desirable to enable automatic ranging to find the right balance of length and Tm for a particular amino acid sequence (see below).

What's new in version 2.2?

Codon score disabling

The codon frequency threshold value will restrict codon usage to those codons which have frequencies equal to or greater than the percent threshold value. However, the top two codons will be used in order to allow some mutational variability. Thus, there is typically no need to maintain a score for codon usage, since the codons are automatically restricted to the highest frequencies.

Disabling the codon score allows for much faster convergence, as well as focusing the optimization on mispriming and repeat minimization. For those users who would still prefer to enable the codon scoring, setting the codon frequency threshold to 100 will turn codon scoring back on.

What's new in version 2.1?

Mispriming analysis

Mispriming occurs when an oligo binds to an unexpected region of the DNA, stable enough to allow the polymerase to initiate and extend a new strand. This is the likely the biggest reason for gene synthesis failure, as it will lead to alternate and disruptive side products (the long smear, rather than the expected bands on a gel).

To counter this, DNAWorks compares the overlap sequences to the rest of the DNA, and attempts to screen out any sequence which has at least 55% identity with the overlap sequence and that has five consecutive nucleotide matches at the 3' end of the oligonucleotide. The number of potential misprimes are displayed in the real time output ("Mis = #"; the number of repeats are also displayed, "Rep = #"). Any oligo:DNA potential mispriming sites that cannot be screened out are displayed in the final output.

Mispriming can be minimized by keeping the melting temperatures of the oligos as high as possible, and by keeping the oligos as long as possible. Unfortunately, doing so will also increase the possibility of introducing errors from oligonucleotide synthesis. Gene synthesis is very much a balancing act.

What's new in version 2?

Gapped oligos

The original version of DNAWorks was restricted to oligos being immediately adjacent to each other. This was primarily due to my belief that the oligos should be as short as possible. However, increases in the efficiency of oligonucleotide synthesis and user demands warranted the ability to gap oligos. In version 2, oligos can be as long as the user wants, but no smaller than 20 nt.

More informative output

The output of each trial now has a diagram of the oligos as they would join together to form the synthetic gene. Arrows display the direction of polymerization, and lines demarcate 10 nt intervals. As in the old version, oligos are alternately displayed in upper and lower case. The translated protein sequence appears below the assembled sequence. The oligos are now numbered from 5' to 3'. This simplifies finding oligos to generate fragments of the synthetic gene.

Here is a sample output with a repeat present:

 The oligonucleotide assembly is:
 ----------------------------------------------------------------
     1       10        20        30        40        50        60
     |        |         |         |         |         |         |

     1 --->                          3 --->
   1 ATGGCGCATCATCACCACCATCATGCC     cgttggcccggaacgccgcctgctggcc
      ACCGCGTAGTAGTGGTGGTAGTACGGGCACGGCAACCGGGCCTTGCGGCG     ccgg
                                                 <---  2
      M  A  H  H  H  H  H  H  A  R  A  V  G  P  E  R  R  L  L  A

     |        |         |         |         |         |         |

                           5 --->
  61 gtgtatacgggcggtaccattgGTATGCGCTCTGAGTTAGGCGTTCTGGTGCCAGGCACC
                       ***********                                < repeat?
     cacatatgcccgccatggtaaccatacgcgagactcaatccgcaagaccacg TCCGTGG
                                                  <---  4
      V  Y  T  G  G  T  I  G  M  R  S  E  L  G  V  L  V  P  G  T


Here is another sample output (DNA only) with a hairpin present in oligos 4 and 5:

 The oligonucleotide assembly is:
 ----------------------------------------------------------------
     1       10        20        30        40        50        60
     |        |         |         |         |         |         |
 
     1 --->             3 --->                                   
   1 GGGGCTACAGTAGATCGCGtagcgatagctctaaaagtttttggccgttgtgagctggcg
     CCCCGATGTCATCTAGCGCATCGCTATCGAGATTTTCAAAAACCGgcaacactcgaccgc
                                           <---  2               
                                                                 
 
     |        |         |         |         |         |         |
 
       5 --->                                        7 --->      
  61 g CGCCATGAAACGTCATGGTTTAGACAATTACCGCGGTTATAGCC  ggcaactgggtt
         ******    ******                                         < hairpin?
     cggcggtactttgcagtaccaaatcTGTTAATGGCGCCAATATCGGACCCGTTGACCCAA
                       <---  4                                 <-

Internal repeats and hairpins are now highlighted within the sequence for user inspection. Generally the hairpins that are formed are thermodynamically weak, but repeats that can not be eliminated can lead to mispriming and synthesis failures. The repeats shown in the output are only those that occur at 3' ends of the oligonucleotides, and so are likely the most dangerous. Repeats that occur due to amino acid motif repetition are very difficult to deal with, and should be monitored closely.

Automatic ranging

In the old version, a user had to submit many runs to test various lengths and annealing temperatures. In version 2, entering the value "50-55" in the oligo length textbox allows automatic ranging of length from 50 to 55 nt. The user can then look to the final summary to decide which length and annealing temperature gave the best results.

Faster optimization

Optimization of the synthetic sequence is more efficient and faster in version 2. This is because scoring is now done on codons, rather than overlaps. Thus the program does not waste time "guessing" which codon is most troublesome to the sequence score. This also makes multiple solutions generally unnecessary, as identical parameters now result in very similar solutions.

PCR condition parameters

The algorithm for determining annealing temperatures is expanded to include factors for oligonucleotide, monovalent cation, and magnesium concentrations. These factors can change the annealing temperatures of oligonucleotides dramatically. The user can thus anticipate the effects of the final PCR conditions instead of guessing.

The equations and values for determining annealing temperatures are from SantaLucia & Hicks, 2004.

Thermodynamically balanced inside-out mode output

The method of gene synthesis employed by DNAWorks is termed "thermodynamically balanced", in that all the oligonucleotides should assemble and anneal at the same temperature. The amplification occurs everywhere at once, and ideally can generate the gene with just one round of PCR. However, there are sticky cases where the gene does not amplify, and constructing the gene in pieces is not successful.

A more controlled method of gene synthesis, termed "thermodynamically balanced inside-out", was developed for cases where problems occurred during PCR synthesis (Gao, et al., 2003). In an assembly set of oligonucleotides, the first half of the oligos are all synthesized in the sense orientation, and the other half are synthesized as reverse complements in the anti-sense orientation of the gene. The gene assembly and amplification is thus done in steps of 0.4-0.6 kb from the center pair of oligonucleotides outward.

The new version of DNAWorks allows for the conventional mode output or inside-out mode output. This simplifies the synthesis of oligonucleotides for gene synthesis.

Hints and Suggestions

Some Notes on PCR-Based Gene Synthesis

The method of generating synthetic genes proceeds through four steps:

  1. Designing and synthesizing oligonucleotides
  2. Gene assembly by PCR
  3. Gene amplification by PCR
  4. Subcloning the PCR product

Designing and Synthesizing Oligonucleotides

Use DNAWorks to generate oligonucleotide sequences. Make very sure that everything is what it should be; i.e., no unforseen restriction sites within the gene, sequences are correct, flanking sequences are proper, etc. Then order the oligos you need.

Gene Assembly

Dissolve primers in appropriate amounts of water to 0.3 mM. Mix 20 ul of each into a single tube. Dilute the oligos to a final concentration of 5 uM each. Mix the oligos into the PCR mixture (buffer, Mg2+, polymerase, etc.) to a final concentration of 0.2 uM. Run standard PCR with the annealing temperature entered into DNAWorks. The elongation time will depend on the length of the gene and the polymerase used. For Pfu Ultra (Stratagene), 30 seconds/1 kb is sufficient. Generally 15 cycles should give enough template for the next step.

This will probably result in a smear on agarose electrophoresis. Not to worry, an additional step will pull out the proper fragment.

For details on the method of thermodynamically balanced inside-out gene synthesis, please see (Gao, et al., 2003).

Gene Amplification

Dilute outer primers (those primers that match the 5' ends of the PCR product) to 5 uM each. Run standard PCR with the outer oligos at 0.1 to 0.4 uM final concentration, using 1 ul of the initial PCR (the assembly PCR) as template. Again, 15 cycles should be enough to see product on a gel. After PCR, there should be a single band by agarose electrophoresis.

Once PCR is done, purify PCR fragment:

  1. Add agarose gel running buffer and run PCR reaction out on 1% agarose gel
  2. After staining gel with EtBr, remove correct sized band and place into sterile Eppendorf tube
  3. Purify gene from gel

Subcloning

Cloning of the gene can be done by ligation, topoisomerase (TOPO system), or transposases (Gateway system). See the specific protocols for these methods. If ligation will be used, it may be useful to scale up the Gene Amplification step. I would recommend four tubes of 100 ul each (400 ul total PCR product) to be very safe.

Crucial Parameters

It may help to run "touchdown" and "hot start" PCR to minimize mispriming during gene assembly and amplification. This should help in producing a single band.

Optimizing the PCR conditions may also help in minimizing (or elminating) errors introduced during PCR. We have seen an error rate of 2 errors per 1 kb sequenced. Generally, one should not need to sequence more than 3 clones to find one with the correct sequence. In brief tests, keeping oligonucleotide lengths below 45 nucleotides, optimizing PCR conditions, and using Pfu Ultra completely eliminated PCR-introduced errors.

Low annealing temperatures (55-58°C) are not as critical as the length of the overlap. Make sure that none of the overlaps drop below 12 nucleotides. While the annealing temperature for that overlap may be above 58°C (such as a high G+C sequence), a short overlap may allow mispriming.

The Promise and Peril of Synthetic Genes

Why Gene Synthesis?

In the post-genomic era, thousands of unknown proteins have become available for study. While in theory the structures and functions of many of these proteins may be determined by comparative analysis (Bork et al., 1998) , in most cases, overexpression and purification of target proteins will be necessary (Baxter & Fetrow, 2001) (Gerlt & Babbitt, 2000). Although the use of naturally occurring genes might appear to be the quickest approach, many such genes will prove to be suboptimal for cloning and overexpression in heterologous systems like Escherichia coli or yeast. The potential problems include high G+C content, codon bias and complex intron/exon structures. An approach to overcoming the complications in cloning is gene synthesis. In this approach, the protein coding sequence can be directly optimized for the expression system of choice. Variants of this strategy include oligonucleotide ligation (Heyneker et al., 1976) , the FokI method (Mandecki & Bolling, 1988) and self-priming PCR (Dillon & Rosen, 1990) . A particularly appealing method, due to its inherent simplicity, is assembly PCR (Stemmer et al., 1995) . This involves generating overlapping oligonucleotides which, when assembled, form the template for the gene of interest. The oligonucleotides are then repetitively extended by PCR, to assemble the full-length gene in a single step.

While this method is simple in principle, in practice numerous complications can lead to errors in the synthesis. To reduce the possibility of errors during oligonucleotide synthesis, the oligonucleotides should be rather short, yet they must still be long enough to provide stable priming overlaps. Any deleterious secondary structures in the oligonucleotides and gene also need to be avoided. Further, the presence of internal repeats within the sequence can cause mispriming, and any overlooked sequences (such as restriction sites or integration-specific sequences) can cause downstream difficulties with subcloning. Therefore, for large proteins with coding sequences of >300 nt, the process of designing these oligonucleotides is tedious and confusing. In the case of a single gene, the problem can be attacked by manual design, but for projects where high throughput is required (i.e. structural genomics) an automated strategy for synthetic gene design is needed.

In practice, the cost of creating synthetic genes is economically competitive with cloning a gene from a cDNA libary. At approximately 35-50 cents per base, a gene encoding a 200 amino acid protein and 25 nucleotide flanking regions would cost about $400 and 3-5 days working time to synthesize. Added to the benefits are predesigned codon optimization and elimination/addition of restriction sites and promoter regions. These factors tip the scale in favor of synthetic genes.