This first assignment is based on the prerequisites of this course, more specifically basic R coding and basic statistics. It is an example of how we from time to time need to write short scripts for collecting data not immediately available. You will extract some pieces of information (numerical data) from some log-files, and then make some plots and predictions from this.
Make an RMarkdown document in RStudio (on orion), and write a short
report where you add the code needed and answer the questions below,
including the plots. Then Knit
the document to an
HTML-file. Download this HTML-file to your local computer. Then hand it
in on Canvas.
In bioinformatics we sometimes need to run some really heavy jobs on the computer, or the High Performance Computing cluster. In such cases it is important to have some indication of
…in order to complete these jobs.
In this assignment you will predict the memory usage and time usage for assemblies based on the file size of the data we want to assemble. We will talk more about assemblies later in BIN310, for now just think of this as a computer job that require a lot of memory and time, depending on the size of the input data.
Here you must collect information about memory and time usage from some log-files. These log-files are results from data sets we have already assembled. Our problem is that the values we seek are ‘hidden’ inside these files, and we need some coding to collect them. You then arrange data in a table, make some plot and fit a simple linear regression model to predict both memory and time usage of some other jobs we have not yet started.
The file /mnt/courses/BIN310/assignment1/file_sizes.txt
is a tab-separated file listing the size (bytes) of the fastq-files for
32 samples we have already assembled. Read this into R. This should be a
table with 2 columns and 32 rows. Note that each sample is identified by
a number. The size are in bytes, convert them to gigabytes.
The file_size
value for each sample is the sum of the
size of the two compressed fastq-files (R1 and R2 files). Once we have a
data set, we can immediately compute this by adding the two file sizes.
From this we want to predict time and memory usage. To reveal how much
memory and time it takes to assemble a data set of a given size, we need
to collect memory and time usage for these data sets.
In the folder /mnt/courses/BIN310/assignment1/
you also
find 32 log-files (extension .log
). These are text files
with the output from the assembly of the 32 samples, one file for each
sample. These files contain a lot of text, but we are only interested in
3 particular pieces of information in each file.
Use the list.files()
to collect all file names.
You then need to loop over all log-files. For each log-file, read it
in line by line using readLines()
. This will give you a
vector of texts, one text for each line.
To find the number identifying the sample (sample_id
in
the file size table), find the line that contains the text
"The file_stem is"
. There should be only one line in each
file with this particular text. You can use str_which()
to
find which line this is. Then, word number 4 on this line should be a
text that looks something like
"AQUAeD_20230201_metagenomeAQUAeD96_37"
where the numbers
varies between files. Collect this word by using the function
word()
in R on the line identified. Notice this text is
again a set of 4 words, separated by "_"
, and the last of
these is the number you must collect. Again, use word()
to
collect this. You may safely assume it is always word number 4 if you
split the text at each "_"
. Remember to convert this from
text to a number once you have collected it.
Next, use the same ideas to find the memory usage. Look for the line
with the text "Memory statistics"
. Then, once you know
which line this is (e.g. line number 100), add 3 to this, and you have
the line with the memory information (e.g. line number 103). The line
with the memory value typically may look like
"8784604.batch 512.0 265.9"
. It is the last word of
these (here "265.9"
) you should collect. This is the memory
usage given as gigabytes. Note there are several blanks between the
words in this line! Use the regular expression sep=" +"
in
word()
to tell word()
to consider consecutive
blanks as a single separator.
Finally, the time is collected in a similar way. Find the line with
the text "Elapsed wallclock time"
. It may look something
like "Elapsed wallclock time: 1.6 days"
, and you collect
the last two words here. We want the time expressed in
hours, and therefore you also need the unit, here days. This
varies between the files. Thus, if the unit is days (like here) the time
is 1.6 multiplied with 24. If it is minutes, you need to divide by 60
etc.
Collect these pieces of information from each file. Remember to
convert the texts to numbers (as.numeric()
). Finally join
(use full_join()
) this with the file sizes table from
above, and store as a new table. Thus, you should have a table with 4
columns (sample_id
, file_sizes
,
memory
and time
) and 32 rows.
Plot memory
versus file_size
as a
scatter-plot (points) and then time
versus
file_size
in the same way. Add proper texts to the axes.
Try to make both appear as two panels in the same figure, but two
separate figures are also OK.
Fit a simple linear regression model using memory
as the
response variable and file_size
as the explanatory
variable, use the lm()
function. The largest data set we
have not yet assembled has a file_size
of 57 gigabytes (57
billion bytes). What is the predicted memory usage for this data set,
based on your fitted model?
Fit also a similar model, but use time
as response
instead of memory
. Then predict also the time usage for the
data set with 57 gigabytes file size.
Based on these results, discuss how much memory you would reserve for doing the assembly of the largest data set? Hint: Consider the uncertainty in the prediction, reserving too little memory will make it crash.
As a guide, I will add rubrics at the end of each project. In order to pass this the RMarkdown report must contain:
file_size.txt
file into R.