UKC Forums - Gene Bank help

Gene Bank help

This topic has been archived, and won't accept reply postings.

crossdressingrodney 23 Sep 2019

Hi there,

I'm looking for some help to understand how to access and interpret certain genetic data from Genebank and I know there are a few biologists on here.

For context, I'm a maths teacher (who doesn't know much biology) with a student who wants to do a simple project looking at the relative frequency of codons in a gene - there is some (possibly crackpot) theory that the relative frequencies are related to the golden ratio.

However, biology is proving a little too messy for us. We have picked a gene and managed to download a massive sequence of A, C, T, G's but we're struggling to extract the sequence of codons that would actually be used to create amino acids.

I think we're struggling to identify the "reading frame"; or possibly we're reading the dictionary wrong and what I think are the codons for start and stop are actually not.

Anyway, we have written a little Python code that searches along the string, and snips out any substring that starts with a START codon, ends with STOP and has length a multiple of three. Unfortunately the gene we choose (LCT), which is massive, only seems to code for about 3 small proteins, so something seems to be wrong somewhere.

Any advice gratefully received.

Oli

cb294 23 Sep 2019

In reply to crossdressingrodney:

Which gene/accession number did you download?

I assume you are trying to look at human lactase, which has several versions / isoforms.

Pick one, guess it does not matter which, and make sure you have the mRNA / cDNA sequence rather than the genomic one which will be split by introns.

E.g. going to pubmed.org, selecting nucleotide as search option, and entering lactase human, will get you as third hit

https://www.ncbi.nlm.nih.gov/nuccore/NM_000404.4?report=genbank

Scroll down, and before the actual sequence you will find the annotation.

What you need is CDS, which tells you that the translation start is a 62 and the TGA stop codon at 2093.

Hope that helps,

kathrync 23 Sep 2019

In reply to crossdressingrodney:

If you are talking about codons, I assume you are primarily interested in the coding sequence, i.e., the bit that is actually translated into a protein. When you download a sequence for a gene, you will probably have several options which represent the gene in different contexts or different levels of processing:

The genomic sequence. This will include any introns as well as known untranslated regions (UTRs) immediately up- or downstream of the gene.

The mRNA. This is the product of transcribing the gene. This should not have introns, but will still have up- and downstream untranslated regions (these often have regulatory functions).

The CDS, or coding sequence. This only includes the parts of the sequence that are translated, so it should start at the start codon, stop at the stop codon, and not contain any introns. Unless there was an error, or an incomplete sequence was uploaded to GenBank, this will be in the correct frame. This is what you want.

It doesn't really matter which of these you download. The CDS is easiest if you can get it because you don't need to do anything to it. If you end up downloading something else, you should also be able to get the annotation, which will tell you where start codon, exon boundaries and stop codon are.

In cb294s example above, it looks like his link sends you to an mRNA sequence, so you don't have to worry about the introns, and there are the coordinates for the start and stop codon on the same page as he has highlighted. You can use these in your Python script to slice the untranslated regions at the start and end of the mRNA sequence away to obtain the CDS. You can build checks in to your script to make sure this makes sense (e.g., the start codon is always ATG, the stop codon is TAG, TAA or TGA and there should only be one stop codon at the end, although ATG can occur internally in the CDS, and the number of bases should be exactly divisible by 3).

On the page that cb294 linked to, there is also a link directly to the CDS sequence. The link is just below the CDS annotation that cb294 highlighted, and looks like this: db_xref="CCDS:CCDS43061.1". The page it takes you to has the CDS sequence for this isoform, which should require no further modification for your purposes: https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi?REQUEST=CCDS&DATA=CCDS...

Bob Kemp 23 Sep 2019

In reply to kathrync:

I love the way you can find experts on virtually anything with UKC!

hang_about 23 Sep 2019

In reply to crossdressingrodney:

The python library biostrings helps with this. I used it to hack together some code for demultiplexing high throughput sequence and it's a nice library.

cb294 23 Sep 2019

In reply to kathrync:

Good point, should have thought of downloading CDS directly. Lazy wins!

OP crossdressingrodney 25 Sep 2019

In reply to both:

Wow, two immediate expert responses. UKC is magic sometimes.

The CDS sequence of nucleotides is exactly what I need, and it's dead clear on that link how to read it. Thank you, both. Should any mysterious mathematical patterns arise, I will be back for further advice!

Oli

kathrync 25 Sep 2019

In reply to crossdressingrodney:

> Wow, two immediate expert responses. UKC is magic sometimes.

> The CDS sequence of nucleotides is exactly what I need, and it's dead clear on that link how to read it. Thank you, both. Should any mysterious mathematical patterns arise, I will be back for further advice!

> Oli

No problem!

I would be interested to see what your student finds anyway, just out of curiosity

Post edited at 10:38

cb294 25 Sep 2019

In reply to crossdressingrodney:

You're welcome! Works both ways, I remember a great post by you on the Riemann zeta function and analytical extension!

hang_about 25 Sep 2019

In reply to crossdressingrodney:

> For context, I'm a maths teacher (who doesn't know much biology) with a student who wants to do a simple project looking at the relative frequency of codons in a gene - there is some (possibly crackpot) theory that the relative frequencies are related to the golden ratio.

Whilst the golden ratio stuff may be crackpot, the fact that relative frequencies of codons will differ is not. See https://en.wikipedia.org/wiki/Codon_usage_bias

If you're engineering an organism to make a transgene it's common to shift the codon usage to be optimal for the new host (assuming you're synthesising the transgene from scratch)

cb294 26 Sep 2019

In reply to hang_about:

There are currently attempts to edit out certain redundant codons from the all sites where they appear in the genome of an organism (E. coli at first) , in addition to deleting the matching tRNA genes. This is supposed to make these organisms resistant against infection by viruses. However, virus numbers are so huge, and their replication so sloppy, that I very much doubt this would work for long.

kathrync 26 Sep 2019

In reply to hang_about:

Yes, codon usage bias is interesting because it is very well documented but the factors contributing to biases are poorly understood. In the context of this student's project, if they do find any relationship between relative frequencies and the golden ratio in whatever organism they are looking at first, an extension might be to see if their findings hold in other organisms with differential codon usage patterns.

kathrync 26 Sep 2019

In reply to cb294:

Yes, this is an interesting idea, but it seems to me that as viruses rely on hijacking host machinary for protein synthesis anyway, they would find a way to adapt pretty quickly...

cb294 26 Sep 2019

In reply to kathrync:

The idea is that phages, having replicated in bacteria elsewhere, would use genes still containing these codons, which would not be translatable by the edited host bacteria ( e.g. in bio-manufacturing plants), and virus replication would be stopped right away.

However, any pool of viruses will contain mutations in every single base, and there are other mechanisms like template switching that allow virus production from several co-infecting copies, so I assume that statistically this approach will fail at some point, probably quite rapidly.

This is actually useful. When I did my diploma thesis in virology ages ago there were no BACs yet, and YACs were pain, so we simply chopped our normally continuous 140kbp herpesvirus genomes into 20 overlapping plasmids (which could easily be edited individually), and simply transfected the mix. Worked a treat every time.

kathrync 26 Sep 2019

In reply to cb294:

Yes, I could already see the statistical problems relating to high viral mutation rates and co-infection. I admit I defaulted to thinking about medical applications where this seems like a distinctly impractical approach. Even agreeing that failure is likely, industrial applications make a lot more sense!

Ha, that's cool! I worked with a herpesvirus (MDV) during my PhD, but we already had BACs by then so I never tried this.

New Topic

This topic has been archived, and won't accept reply postings.

Latest Jobs 4 New

Jobs Glenbrittle Campsite Wardens 2024

Elsewhere on the site

Press Release Evidence-based mental training and fear management course for climbers

Podcast Mountain Air - 9. Doug Bartholomew: Managing Beinn Eighe's Wild Spaces

Gear News The Art of Climbing – Out Now

News IFSC Boulder World Cup Salt Lake City 2024 - Report

After the IFSC World Cup season kicked off at the new location of Keqiao last month, the second comp of the season saw the athletes returning to a familiar venue, the home of US competition climbing - Salt Lake City.

7 May

UKC Advertising