The task - memory and time usage

In bioinformatics we sometimes need to run some really heavy jobs on the computer, or the High Performance Computing cluster. In such cases it is important to have some indication of

How much computer memory we will need
How much time it will take

…in order to complete these jobs.

In this assignment you will predict the memory usage and time usage for assemblies based on the file size of the data we want to assemble. We will talk more about assemblies later in BIN310, for now just think of this as a computer job that require a lot of memory and time, depending on the size of the input data.

Here you must collect information about memory and time usage from some log-files. These log-files are results from data sets we have already assembled. Our problem is that the values we seek are ‘hidden’ inside these files, and we need some coding to collect them. You then arrange data in a table, make some plot and fit a simple linear regression model to predict both memory and time usage of some other jobs we have not yet started.

File sizes

The file /mnt/courses/BIN310/assignment1/file_sizes.txt is a tab-separated file listing the size (bytes) of the fastq-files for 32 samples we have already assembled. Read this into R. This should be a table with 2 columns and 32 rows. Note that each sample is identified by a number. The size are in bytes, convert them to gigabytes.

The file_size value for each sample is the sum of the size of the two compressed fastq-files (R1 and R2 files). Once we have a data set, we can immediately compute this by adding the two file sizes. From this we want to predict time and memory usage. To reveal how much memory and time it takes to assemble a data set of a given size, we need to collect memory and time usage for these data sets.

Extract information from a log-file

In the folder /mnt/courses/BIN310/assignment1/ you also find 32 log-files (extension .log). These are text files with the output from the assembly of the 32 samples, one file for each sample. These files contain a lot of text, but we are only interested in 3 particular pieces of information in each file.

The number of the sample. We need this in order to know which sample each log-file belongs to.
The number describing the memory usage.
The number describing the time usage, as well as the time-unit used.

Use the list.files() to collect all file names.

You then need to loop over all log-files. For each log-file, read it in line by line using readLines(). This will give you a vector of texts, one text for each line.

To find the number identifying the sample (sample_id in the file size table), find the line that contains the text "The file_stem is". There should be only one line in each file with this particular text. You can use str_which() to find which line this is. Then, word number 4 on this line should be a text that looks something like "AQUAeD_20230201_metagenomeAQUAeD96_37" where the numbers varies between files. Collect this word by using the function word() in R on the line identified. Notice this text is again a set of 4 words, separated by "_", and the last of these is the number you must collect. Again, use word() to collect this. You may safely assume it is always word number 4 if you split the text at each "_". Remember to convert this from text to a number once you have collected it.

Next, use the same ideas to find the memory usage. Look for the line with the text "Memory statistics". Then, once you know which line this is (e.g. line number 100), add 3 to this, and you have the line with the memory information (e.g. line number 103). The line with the memory value typically may look like "8784604.batch 512.0 265.9". It is the last word of these (here "265.9") you should collect. This is the memory usage given as gigabytes. Note there are several blanks between the words in this line! Use the regular expression sep=" +" in word() to tell word() to consider consecutive blanks as a single separator.

Finally, the time is collected in a similar way. Find the line with the text "Elapsed wallclock time". It may look something like "Elapsed wallclock time: 1.6 days", and you collect the last two words here. We want the time expressed in hours, and therefore you also need the unit, here days. This varies between the files. Thus, if the unit is days (like here) the time is 1.6 multiplied with 24. If it is minutes, you need to divide by 60 etc.

Collect these pieces of information from each file. Remember to convert the texts to numbers (as.numeric()). Finally join (use full_join()) this with the file sizes table from above, and store as a new table. Thus, you should have a table with 4 columns (sample_id, file_sizes, memory and time) and 32 rows.

Plot

Plot memory versus file_size as a scatter-plot (points) and then time versus file_size in the same way. Add proper texts to the axes. Try to make both appear as two panels in the same figure, but two separate figures are also OK.

Predict memory and time usage

Fit a simple linear regression model using memory as the response variable and file_size as the explanatory variable, use the lm() function. The largest data set we have not yet assembled has a file_size of 57 gigabytes (57 billion bytes). What is the predicted memory usage for this data set, based on your fitted model?

Fit also a similar model, but use time as response instead of memory. Then predict also the time usage for the data set with 57 gigabytes file size.

Based on these results, discuss how much memory you would reserve for doing the assembly of the largest data set? Hint: Consider the uncertainty in the prediction, reserving too little memory will make it crash.

Assignment 1 - Data wrangling in R

Introduction

The task - memory and time usage

File sizes

Extract information from a log-file

Plot

Predict memory and time usage

Rubrics