Assigned | Wednesday Sept 23 |
Program Due | 11:59pm, Tuesday Oct 6 |
Points | 40 points - Input Tests
10 points - valgrind output (definitely lost, probably lost) 10 points - makefile, design, style 60 points total |
Updates |
You love The C Programing Language text so much that you own 10 copies of it, but your evil nemesis, who knows how dear it is to you, has taken a pair of scissors to every one of your copies and randomly cut each into thousands of pieces. Amidst your tears, you attempt to reassemble the fragments in order to recreate a copy of the original text. Little did you know that solving this problem would prep you for a career as a computational biologist.
For this assignment, you will write a program that reads a file of text fragments and attempts to reconstruct the original document from the fragments. The fragments were created by duplicating the original document many times over and dividing each copy into pieces. The fragments overlap one another and your program will search for and align the overlapping string fragments to try to reassemble the fragments into their original order.
This task is based on a technique used for genome shotgun sequencing in the Human Genome Project. Current sequencing technology can only sequence chunks of 500 or so base pairs, yet the human genome is billions of pairs long. Thus, the DNA to be sequenced is cloned several times and the clones are randomly cut into sequenceable-sized chunks. The reassembly step aligns those chunks to reconstruct the original strand. Nova had a fascinating show on Cracking the Code of Life. Their online animation of a genome sequencer explains the process of gene sequencing. The program you will write simulates a simplified version of the assembly step.
The program operates in a series of rounds. In each round, it searches the collection to find the pair of fragments with the maximal overlap match and merges those two fragments. This match/merge operation decreases the total number of fragments by one. The program repeats the match/merge operation until only one fragment remains.
Matching a pair of fragments means finding a position to align the two for the maximal overlapping match. In each round, you identify the pair of fragments in the collection with longest such overlap and merge them.
Consider a collection with these fragments (extra spaces were inserted between letters for clarity) :
s1: a l l i s w e l l s2: e l l t h a t e n s3: h a t e n d s4: t e n d s w e l l
On the first round, the longest overlap found is a 6-character overlap between s2 and s3 when aligned as below:
e l l t h a t e n h a t e n dThe fragments s2 and s3 would be removed and replaced with their merged result s5:
s5: e l l t h a t e n dThe new, merged fragment (s5) is a candidate for future rounds, the two fragments it was composed from (s2 and s3) are no longer considered. On the next round, the longest overlap is 5 characters between s5 and s4 aligned as below:
e l l t h a t e n d t e n d s w e l lThe fragments s5 and s4 would be removed and replaced with their merged result s6:
s6: e l l t h a t e n d s w e l lThe last round merges s1 and s6 in their maximal overlap alignment of 3 characters:
a l l i s w e l l e l l t h a t e n d s w e l lThe one remaining fragment is the final result:
a l l i s w e l l t h a t e n d s w e l l
Note another kind of match possibility is when one fragment is contained within another (i.e is a substring of another). Consider:
s1: s w e l l t h a s2: e l l
In this case, s2 is completely contained within s1. When these two fragments are merged, the result is simply s1.
Your program can break ties arbitrarily. If there are several pairs with the same maximal overlap, choose whichever is easiest for you. If there are two equally maximal alignments for a pair (e.g. abxy and xyab merges to either abxyab or xyabxy), you can choose either. Similarly if two fragments have no overlap, their merge is simply the concatentation of the two strings in either order. Because of differences in how ties are broken, your final result might differ somewhat from our sample program. This is no cause for alarm.
Note that all characters in the overlap must match exactly in sequence. Finding an alignment between two DNA strands is complicated by the fact that the data is very "noisy" and fragments may have gaps and mutations that need to be considered. In our simplified process, you can assume the data is "clean".
Optimal reassembly is an example of the shortest common superstring problem, a classic problem in theoretical computer science. Given a set of strings S, find the shortest string that contains all the strings in S as substrings. The solution for shortest superstring is NP-hard, which means it is believed that no polynomial (i.e. "efficient") solution exists. Our approach is called a greedy strategy because it finds a local maxima (the longest overlap) and assumes making this locally optimal choice will eventually lead to the globally optimal result. It might, but it might not. Our approach will find a common superstring, but it is not guaranteed to be the shortest. You might find it interesting to consider what changes would be required for your program to find an optimal result and what effect those changes would have on the running time.
#fragment 1##fragment 2#...#fragment N#Each fragment is wrapped in a starting and ending # marker. A fragment can contain any character other than # (including spaces, newlines, etc.) This format is designed for reading fragment by fragment. Hint: consider using fscanf with a character exclusion set ( %[^#] ). And while I'm hinting, remember why you want to specify a limit whenever reading into a string.
valgrind --tool=memcheck --leak-check=full program-to-execute argsNote that your program will run much more slowly under valgrind because of all of the extra checking. Your goal is to reduce "definitely lost" and "probably lost" to zero. Hint: it's good programming practice (and helpful for debugging) if memory is allocated and freed in the same function (although in some complex applications this may not be possible).