In this assignment you will make a genome assembly from some Illumina data, evaluate this and also search for some specific marker genes in the assembled contigs. This is mostly based on what we have done in modules 4 and 5.
Write a short report in RMarkdown, use the same section headers as in
this document below. You may use figures produced by quast
in your report, learn how to include figure files in RMarkdown. You must
also write some short text discussing the results and the
questioons below.
Essential R and shell code (scripts) you used must be included in the
RMarkdown document. You can insert chunks for both R and bash in
RMarkdown. To just show the code, without running it, set the
eval=FALSE
, like this:
```{bash, eval=FALSE}
<put your code here>
```
The same works for chunks of R-code as well, in case you want to show code without running it. Note that any code that produces plots must be run in order to produce the plots!
Add links to GitHub or similar websites for the software tools you use. Think of this as a preliminary report in a larger scientific project.
Several isolates of Pseudomonas aeruginosa have been sequenced, using Illumina HiSeq paired-end sequencing. Each student have their own isolate data set. You will find your data in
$COURSES/BIN310/assignment2/fastq/<username>_R1.fq.gz
$COURSES/BIN310/assignment2/fastq/<username>_R2.fq.gz
where you replace <username>
by your orion user
name. Please don’t copy these files, just read them from where they
are.
In your report, include first the answers these questions:
You can base your answers on reading the R1 file only, since the read lengths are usually very similar in the R2 file.
Compute an assembly of your data. Make certain your shell scripts
output to the $SCRATCH
disk, and copy only the fasta-files
with contigs back to your working directory. Remember to delete from
$SCRATCH
the stuff you don’t need, once you are done.
In order to assess the quality of the assembly, compare it to 3
different reference genomes. You find these in
$COURSES/BIN310/assignment2/
as well. They all have the
species name Pseudomonas_aeruginosa
in their names, and the
file extension .fna
. The strain PAO1
is the NCBI Reference Genome for this species. The strain
DSM 50071
is a typical lab-strain you can purchase. The
strain Carb01 63
is included because it has a larger genome
than the other two.
Use the software quast
, with these genomes as
references, one by one. Output to different folders.
Include some relevant results/figures in your report, and state
something about the assembly quality based on your results. NB! The
quast
outputs results in a folder, and in this you find a
subfolder named basic_stats
in which you may find some
useful figures (.pdf
files).
Once you have the assembled contigs, make a BLAST search in the
contigs for the 7 MLST markers for this species. These 7 housekeeping
genes (acsA, aroE, guaA, mutL, nuoD, trpE) should be present in all
Pseudomonas aeruginosa
, but we want to know exactly which
variants are in your genome.
The fasta file
$COURSES/BIN310/assignment2/MLST_alleles.fasta
contains
many variants of each of the 7 genes. Search with these as queries
against the contigs, and use the blastn
version of BLAST
(not tblastn
). In order to see if a marker variant is found
with 100% identity we need to have, in the output, both the percent
identity and the length of each marker sequence and the alignment. Add
the following to your command line to ensure this:
-outfmt '6 qseqid sseqid qlen length pident bitscore'
This means your BLAST output will have 6 columns:
qseqid
identifier for query sequence (the marker
variant)sseqid
identifier for contig (not really important this
time)qlen
length of query sequencelength
length of alignmentpident
percent identitybitscore
the bit-score of the alignmentMany of the variants for each marker gene will match more or less your contigs. Find the best match for each of the marker genes, i.e. the one with the largest bit score. Verify if this has
In your report, list the sequence type of your genome, i.e. the text in the Header-line of each of the 7 markers that gave the perfect hits. Also, make some short answer to the following questions:
In order to pass this the RMarkdown report must contain: