Friday, April 9, 2010

The Genome - Going from Letters to Words, Part 1

Let's start with a quick review of our earlier discussions on DNA/RNA.   We said that DNA is our master blueprint, and RNA is like a copy of those instructions that goes to the builders.  We briefly distinguished DNA from RNA by how their sugars are different:  DNA uses deoxyribose and RNA uses ribose. We also said that nucleotides are used to make both molecules, and we discussed how in both cases, the nucleotides are arranged in linear strands, like beads on a string.  We compared the DNA structure to a set of line dancers: one line holding hands with the left and right dancers, and standing opposite their partners.  In DNA there are connections between the two strands, which cause the winding and characteristic double helix form.



This is a picture like the last one we saw.  We can see that the A (adenine) always pairs to T (thymine) and the G (guanine) to C (cytosine).  It's like our line dancers are forced to always choose certain partners to dance with.  In DNA, this forced interaction is because of the bonding between facing pairs: these are the light dashes between the nucleotides.  These are hydrogen bonds between the electronegative (electron withdrawing) nitrogens and oxygens, and the hydrogens on their faced partners.  This is a much weaker kind of bonding than what happens between the side-by-side neighbors (about 1/10th the strength) but these bonds are strong enough to stabilize the structure.  Anytime a structure, how it's bent/shaped or it's position to another object is stabilized, it costs energy to have it not be that way. It's like taking the pacifier away from the baby -- there's a price to pay.  In general, Nature spends energy reluctantly: lower energy means something is more likely to occur. 

So now we have our dancers in two lines facing each other, like a bunch of middle-schoolers in 6th period gym.  For the sake of this example, the kids are named after each nucleotide (so we've got a room of kids, but just 4 names among them.)  Their teacher, Mr. Chargaff, puts them in Group 1 and Group 2, and tells each group to stand in line.  Then he tells them they have to pick partners from the opposite group, and the same partners each time.   What does he tell the substitute teacher to worry about?  What can she gloss over?

The Influence of Primary Structure
Chargraff's rules tell us that all we need to worry about is lining up one line of kids. Once we have that figured out, the position of the other children is automatically enforced.   So if the Group 1 kids are lined up as:

A, T, C, C, G, A; then their partners will line themselves up as
T, A, G, G, C, T, and vice versa.

If the Group 1 kids line up as C, T, C, C, A, A;
then their partners line up as  G, A, G, G, T,  T.

So we only need to get one group of kids organized, and the other kids do the work for us.  The genetic code works in a similar fashion: once we know the line up of one strand, the second strand is precisely defined.  RNA exploits this effect -- it's generally single stranded, which is all that's needed to make an accurate copy of the DNA master print.

So we have a line up of nucleotides.  Where does that get us?

How Robots are like Hamburgers
We have to jump aside a bit and talk about what a genetic code does.  We said it made something, but what exactly is "something"?  What is the code for?  For that matter, what's the code?



We'll answer the second question first.  We said that the genetic code is a blueprint that tells the builders what to make.  Imagine that the builders here are making robots, and each robot is pre-programmed to do a specific job.  Some go to another assembly plant, some keep house, some are conscripted in the military, others stay on site to help, and countless other jobs in the world.  But the key feature here is that there are different types of robots, each with specific jobs, and the builders make them based on the blueprint they're given.  In real life, the robots are proteins, and the blueprints are based on the genetic code.

Protein is a tricky word, because we use it to describe our food intake.  This isn't inaccurate, but it's a little misleading.  Proteins are active molecules, and they do some kind of work.  Sometimes this work is to help or attack other molecules, like when your body's immune system fights off a cold.  Among other things, proteins are used to put together bigger structures, or to facilitate nutrient intake and waste output from them.  (This is a critical feature of living cells, especially muscle cells, so that's why we associate protein with things like hamburgers and steak.)

OK.  Now we know that proteins are important, and that they're made from the genetic blueprint.  But what makes up the proteins?  At the basic level, they're made from the nucleotides we mentioned earlier.  But using nucleotides is sort of like using straw and clay:  you put them together to make bricks.  In this example, the bricks are called amino acids, and these are the building blocks of proteins.  What you need to know now is that there are 20 amino acids.  Later, we'll explain how the 20 are different, and how that influences their uses.  But for now, we've got 20 different amino acids, made somehow from the 4 nucleotides we mentioned before.

Linear Arrangements, Alphabets and Words
Back to our line of nucleotides. We'll skip the step where the master DNA information has been copied to DNA, and so we'll begin with our copied blueprint in hand.  So we want to build some proteins, and we've got a blueprint with a string of nucleotide letters like this:  AUGAAATGGTTCAAGGTCTAA.  This linear sequence is called the primary sequence, and it's a simple ordering of one nucleotide before and after another.  We can imagine the partner sequence, and the joining between each nucleotide partner is called a base pair (bp).   The number of base pairings corresponds somewhat with the complexity of the organism, so simple things like bacteria have about 600,000 bp, while you and I have roughly 3 billion.  We shouldn't get smug, because mice have about the same number.

Now we want to use that primary sequence to make some proteins, but there's one question we need to ask first:

We know we have a lot of proteins, and we know we have 20 amino acids to use as building blocks.  But we only have 4 nucleotides to make our building blocks.  How do we get enough variation in the nucleotides to get the 20 amino acids we need?  

It wouldn't work to completely randomize our primary sequence.  We already know this because we have similarities among populations, from as small as siblings, to as large as domains.  So we know that some biological structures are shared (easy version here), and that give us a hint that there are some predictable ways that our nucleotides arrange themselves to give information. 

So we need a consistent, meaningful way to encode information.  And we need to start by getting the 20 amino acids from the 4 nucleotides.  We can't just use one nucleotide per amino acid, because then we'd only use A, T, G and C one time each, so we'd only get 4 amino acids out of 20.  Using two different nucleotides also doesn't work:


AA   AT     AG    AC
TA    TT     TG     TC
GA   GT     GG    GC
CA   CT     CG     CC

This just gives us 16 distinct options, and we need 20.  If you're math-inclined, you may have noticed that we're looking at powers of 4:  41=4, 42=16, and so we'll need to use 3 nucleotides to encode each amino acid, because (4 nucleotides)3= 64 distinct options, and that's enough room for our 20 amino acids.

Those extra 44 options don't go to waste, and that'll be explained in part 2 of our discussion of the genome.

One quick note.  We already mentioned one difference between DNA and RNA (the sugars),  but there's another important difference too.  The Thymine used in DNA is replaced by Uracil in RNA.  There's a Geekery section that has a clue as to why that might be the case.  Can you think of why this substitution is useful?  We'll give the answer in the Geekery section of the next post.


Geekery:  The missing hydroxyl on DNA's sugar deoxyribose makes it more stable than RNA.  There are hydroxyls (one oxygen bound to one hydrogen) on the sugar molecule, but there's a missing one on DNA.  In general, oxygen pulls electrons away from it's associated carbon, making the carbon more suceptible to nucleophilic attack.  Nucleophilic means "loving a site that's electron-poor", and attack means "one molecule approaches the target, to make a bond."  Since the phosphate backbone also has hydroxyls near the sugar, leaving the hydroxyl at carbon #2 would allow the nucleophilic attack, breaking the diester bonds needed to keep the backbone intact.   So there's a hydrogen there instead.  This is one of a few ways that our genetic information is kept safe - by storing it in a stable structure.

This also helps explain why RNA is single stranded in comparison.  The extra hydroxyl puts RNA in a different helical form than DNA, which is less likely to be double stranded, or even helical through the length of the molecule. The extra hydroxyl is also used to break the phosphate backbone, because it uses the genome in piecewise fashion.

Also, in real life, only about 2% of our genome is actually used to make proteins, even though they are critically important.  Later, we'll talk about why that might be the case.


images:
http://www.flickr.com/photos/liquidator/ / CC BY-NC-SA 2.0
http://followchemistry.files.wordpress.com


No comments:

Post a Comment