CMSC 313 Project 2

String Fragments

Assigned Wednesday Sept 23
Program Due 11:59pm, Tuesday Oct 6
Points 40 points - Input Tests
10 points - valgrind output (definitely lost, probably lost)
10 points - makefile, design, style
60 points total
Updates  

Ideas used in this assignment came from Rich Pattis at UCIrvine and Owen Astrachan at Duke. Original writeup provided by Julie Zelenski at Stanford.
Objectives
Completing this assignment should further your proficiency in:
  1. editing, compiling, testing, using makefiles, and debugging C programs under Unix
  2. writing code to manipulate C-strings
  3. using the C facilities for dynamic memory management (malloc & free)
Project Policy
This project may be a team project. You may work together with one other student from either section on this project. If you wish to work alone, that's fine too. If you do work with a partner, you must include both student's names and UMBC userids in the file header comment of all files that you submit. Only one member of the team should submit the files for grading. Both students on the team will receive the same grade for the project. Please be sure to revisit the course project collaboration policy to determine if/how teams may help each other.
Project Description

The problem

You love The C Programing Language text so much that you own 10 copies of it, but your evil nemesis, who knows how dear it is to you, has taken a pair of scissors to every one of your copies and randomly cut each into thousands of pieces. Amidst your tears, you attempt to reassemble the fragments in order to recreate a copy of the original text. Little did you know that solving this problem would prep you for a career as a computational biologist.

Your assignment

For this assignment, you will write a program that reads a file of text fragments and attempts to reconstruct the original document from the fragments. The fragments were created by duplicating the original document many times over and dividing each copy into pieces. The fragments overlap one another and your program will search for and align the overlapping string fragments to try to reassemble the fragments into their original order.

This task is based on a technique used for genome shotgun sequencing in the Human Genome Project. Current sequencing technology can only sequence chunks of 500 or so base pairs, yet the human genome is billions of pairs long. Thus, the DNA to be sequenced is cloned several times and the clones are randomly cut into sequenceable-sized chunks. The reassembly step aligns those chunks to reconstruct the original strand. Nova had a fascinating show on Cracking the Code of Life. Their online animation of a genome sequencer explains the process of gene sequencing. The program you will write simulates a simplified version of the assembly step.

Greedy match and merge

The program operates in a series of rounds. In each round, it searches the collection to find the pair of fragments with the maximal overlap match and merges those two fragments. This match/merge operation decreases the total number of fragments by one. The program repeats the match/merge operation until only one fragment remains.

Matching a pair of fragments means finding a position to align the two for the maximal overlapping match. In each round, you identify the pair of fragments in the collection with longest such overlap and merge them.

Consider a collection with these fragments (extra spaces were inserted between letters for clarity) :

s1:   a l l   i s   w e l l
s2:   e l l   t h a t   e n
s3:   h a t   e n d
s4:   t   e n d s   w e l l

On the first round, the longest overlap found is a 6-character overlap between s2 and s3 when aligned as below:

     e l l   t h a t   e n
               h a t   e n d 
The fragments s2 and s3 would be removed and replaced with their merged result s5:
s5:  e l l   t h a t   e n d 
The new, merged fragment (s5) is a candidate for future rounds, the two fragments it was composed from (s2 and s3) are no longer considered. On the next round, the longest overlap is 5 characters between s5 and s4 aligned as below:
     e l l   t h a t   e n d
                   t   e n d s   w e l l
The fragments s5 and s4 would be removed and replaced with their merged result s6:
s6:  e l l   t h a t   e n d s   w e l l
The last round merges s1 and s6 in their maximal overlap alignment of 3 characters:
      a l l   i s   w e l l
                      e l l   t h a t   e n d s   w e l l
The one remaining fragment is the final result:
      a l l   i s   w e l l   t h a t   e n d s   w e l l

Note another kind of match possibility is when one fragment is contained within another (i.e is a substring of another). Consider:

s1:    s   w e l l   t  h  a
s2:    e l l

In this case, s2 is completely contained within s1. When these two fragments are merged, the result is simply s1.

Your program can break ties arbitrarily. If there are several pairs with the same maximal overlap, choose whichever is easiest for you. If there are two equally maximal alignments for a pair (e.g. abxy and xyab merges to either abxyab or xyabxy), you can choose either. Similarly if two fragments have no overlap, their merge is simply the concatentation of the two strings in either order. Because of differences in how ties are broken, your final result might differ somewhat from our sample program. This is no cause for alarm.

Note that all characters in the overlap must match exactly in sequence. Finding an alignment between two DNA strands is complicated by the fact that the data is very "noisy" and fragments may have gaps and mutations that need to be considered. In our simplified process, you can assume the data is "clean".

A little computer science theory

Optimal reassembly is an example of the shortest common superstring problem, a classic problem in theoretical computer science. Given a set of strings S, find the shortest string that contains all the strings in S as substrings. The solution for shortest superstring is NP-hard, which means it is believed that no polynomial (i.e. "efficient") solution exists. Our approach is called a greedy strategy because it finds a local maxima (the longest overlap) and assumes making this locally optimal choice will eventually lead to the globally optimal result. It might, but it might not. Our approach will find a common superstring, but it is not guaranteed to be the shortest. You might find it interesting to consider what changes would be required for your program to find an optimal result and what effect those changes would have on the running time.

Implementation details
Getting started
Copy the file Project2.tar from Mr.Frey's public directory (/afs/umbc.edu/users/f/r/frey/pub/313/proj2) to your working directory for Project2. A .tar file is a Unix "Tape ARchive" file, originally used to compact files before they were written to magnetic tape for off-line storage (archiving). Now days, tar is used to package files together for copying, emailing, etc. You could of course, use a compression program like gzip to make the .tar file smaller. Unpack the file using the command tar xvf Project2.tar. This command will deposit several files in your directory. Project2.tar contains a skeleton makefile, a working executable named Project2, and a few sample fragment data files. You will want to create additional testing data files during development. Feel free to share them with your friends and/or post them as an attachment to a message on Blackboard.
Submit
When you're done and ready to hand in your finished submission, don't forget to submit your work! Submit all .c files, .h files, and your makefile. Use the UMBC submit command. The name of this project is Proj2. Remember that only one member of your team should submit the files.