In this assignment you will do some comparative genomics of some Pseudomoas aeruginosa isolates. This is very much based on what we did in module 7.
Your data is a subset of the data described in this paper. Have a brief look at the Introduction of this paper to understand why this species is of interest to comparative genomic studies.
Write a short report in RMarkdown, use the same section headers as in this document below. As usual, you include essential code, figures and some short text discussing the results and the questions below.
Add links to GitHub or similar websites for the software tools you use. Think of this as a preliminary report in a larger scientific project.
Several isolates of Pseudomonas aeruginosa have been
sequenced, using Illumina HiSeq paired-end sequencing. Each student have
their own subset of 20 genomes. These are specified in your personal
genome table. In the folder
/mnt/courses/BIN310/assignment3/genome_tables/
you find
several tab-separated text files, and you use only the file with
your username in it. It is probably a good idea to copy this to
your own assignment3-folder, and store it under the name
genomes_table.txt
to make it similar to what we did in
module 7.
Note that this table has a column source_type
(among
others) indicating the genomes come from two different sources. In the
trees you make below, color the branches according to this and see if
the genomes from the two sources separate or not.
Inside the genome table you find 7 columns specifying the MLST alleles for each genome. This has been obtained in the way we did, for a single genome, in assignment 2. You first make a tree based only on these data.
Compute a 20x20 distance matrix based on this, where the distance
between two genomes (rows) is simply the number of differing alleles,
i.e. it is always an integer from 0 to 7. From these distances, compute
a neighbor joining tree, and plot this. Add colors according to the
source_type
information for each genome.
Hint: In module 7 we did not do exactly this, but we did something similar when we constructed a pan-matrix tree. Instead of the pan-matrix you now have the MLST sequence types
Next, we want to make a tree based on a core-gene alignment. In the
folder /mnt/courses/BIN310/assignment3/contigs/
you find
the assembled contigs for the genomes. In the genome table, the column
contigs_file
give you the names of the files you should
use.
First, run prokka
to annotate each of the 20 assembled
genomes. Store only the output GFF-file, and name each file as
<genome_id>.gff
where <genome_id>
is the text in the genome_id
column for each genome. Put
these GFF-files into a separate folder.
Next, use roary
and the GFF-files from above as input,
to compute the pan genome. Make certain you get a core-gene alignment
from roary
.
Finally, read the core-gene alignment into R, compute evolutionary distances and a neighbor joining tree for the genomes. Plot and color as above.
In the genome table it is also specified the raw read files
(R1
and R2
) for each genome. These are found
in /mnt/courses/BIN310/assignment3/fastq/
, and again you
only use the data for your genomes.
Compute mash
sketches, and then mash
distances between all pairs of genomes directly from the reads.
Read the distances into R, and compute again a neighbor joining tree based on this. Plot and color as before.
Briefly discuss the results you got.
source_type
is important
for how the genomes are related?In order to pass this the RMarkdown report must contain: