CMSC 313 Project 2

String Fragments

Assigned	Wednesday Sept 23
Program Due	11:59pm, Tuesday Oct 6
Points	40 points - Input Tests 10 points - valgrind output (definitely lost, probably lost) 10 points - makefile, design, style 60 points total
Updates

Ideas used in this assignment came from Rich Pattis at UCIrvine and Owen Astrachan at Duke. Original writeup provided by Julie Zelenski at Stanford.

Objectives

Completing this assignment should further your proficiency in:

editing, compiling, testing, using makefiles, and debugging C programs under Unix
writing code to manipulate C-strings
using the C facilities for dynamic memory management (malloc & free)

Project Policy

This project may be a team project. You may work together with one other student from either section on this project. If you wish to work alone, that's fine too. If you do work with a partner, you must include both student's names and UMBC userids in the file header comment of all files that you submit. Only one member of the team should submit the files for grading. Both students on the team will receive the same grade for the project. Please be sure to revisit the course project collaboration policy to determine if/how teams may help each other.

Project Description

The problem

You love The C Programing Language text so much that you own 10 copies of it, but your evil nemesis, who knows how dear it is to you, has taken a pair of scissors to every one of your copies and randomly cut each into thousands of pieces. Amidst your tears, you attempt to reassemble the fragments in order to recreate a copy of the original text. Little did you know that solving this problem would prep you for a career as a computational biologist.

Your assignment

For this assignment, you will write a program that reads a file of text fragments and attempts to reconstruct the original document from the fragments. The fragments were created by duplicating the original document many times over and dividing each copy into pieces. The fragments overlap one another and your program will search for and align the overlapping string fragments to try to reassemble the fragments into their original order.

This task is based on a technique used for genome shotgun sequencing in the Human Genome Project. Current sequencing technology can only sequence chunks of 500 or so base pairs, yet the human genome is billions of pairs long. Thus, the DNA to be sequenced is cloned several times and the clones are randomly cut into sequenceable-sized chunks. The reassembly step aligns those chunks to reconstruct the original strand. Nova had a fascinating show on Cracking the Code of Life. Their online animation of a genome sequencer explains the process of gene sequencing. The program you will write simulates a simplified version of the assembly step.

Greedy match and merge

The program operates in a series of rounds. In each round, it searches the collection to find the pair of fragments with the maximal overlap match and merges those two fragments. This match/merge operation decreases the total number of fragments by one. The program repeats the match/merge operation until only one fragment remains.

Matching a pair of fragments means finding a position to align the two for the maximal overlapping match. In each round, you identify the pair of fragments in the collection with longest such overlap and merge them.

Consider a collection with these fragments (extra spaces were inserted between letters for clarity) :

s1:   a l l   i s   w e l l
s2:   e l l   t h a t   e n
s3:   h a t   e n d
s4:   t   e n d s   w e l l

On the first round, the longest overlap found is a 6-character overlap between s2 and s3 when aligned as below:

     e l l   t h a t   e n
               h a t   e n d

The fragments s2 and s3 would be removed and replaced with their merged result s5:

s5:  e l l   t h a t   e n d

The new, merged fragment (s5) is a candidate for future rounds, the two fragments it was composed from (s2 and s3) are no longer considered. On the next round, the longest overlap is 5 characters between s5 and s4 aligned as below:

     e l l   t h a t   e n d
                   t   e n d s   w e l l

The fragments s5 and s4 would be removed and replaced with their merged result s6:

s6:  e l l   t h a t   e n d s   w e l l

The last round merges s1 and s6 in their maximal overlap alignment of 3 characters:

      a l l   i s   w e l l
                      e l l   t h a t   e n d s   w e l l

The one remaining fragment is the final result:

      a l l   i s   w e l l   t h a t   e n d s   w e l l

Note another kind of match possibility is when one fragment is contained within another (i.e is a substring of another). Consider:

s1:    s   w e l l   t  h  a
s2:    e l l

In this case, s2 is completely contained within s1. When these two fragments are merged, the result is simply s1.

Your program can break ties arbitrarily. If there are several pairs with the same maximal overlap, choose whichever is easiest for you. If there are two equally maximal alignments for a pair (e.g. abxy and xyab merges to either abxyab or xyabxy), you can choose either. Similarly if two fragments have no overlap, their merge is simply the concatentation of the two strings in either order. Because of differences in how ties are broken, your final result might differ somewhat from our sample program. This is no cause for alarm.

Note that all characters in the overlap must match exactly in sequence. Finding an alignment between two DNA strands is complicated by the fact that the data is very "noisy" and fragments may have gaps and mutations that need to be considered. In our simplified process, you can assume the data is "clean".

A little computer science theory

Optimal reassembly is an example of the shortest common superstring problem, a classic problem in theoretical computer science. Given a set of strings S, find the shortest string that contains all the strings in S as substrings. The solution for shortest superstring is NP-hard, which means it is believed that no polynomial (i.e. "efficient") solution exists. Our approach is called a greedy strategy because it finds a local maxima (the longest overlap) and assumes making this locally optimal choice will eventually lead to the globally optimal result. It might, but it might not. Our approach will find a common superstring, but it is not guaranteed to be the shortest. You might find it interesting to consider what changes would be required for your program to find an optimal result and what effect those changes would have on the running time.

Implementation details

The makefile You must submit a makefile that allows the graders to enter the command "make Project2" at the Unix prompt and results in the creation of an executable file named Project2. A skeleton makefile is provided for you.
Program operation. The program takes one required argument, the filename containing the fragments. If no filename is given or the file cannot be opened, the program exits with an appropriate error message. Otherwise, it will reassemble the fragments using the greedy maximal match/merge and print the final merged result.
Fragment file format. The input files contain fragments in this format:
```
#fragment 1##fragment 2#...#fragment N#
```
Each fragment is wrapped in a starting and ending # marker. A fragment can contain any character other than # (including spaces, newlines, etc.) This format is designed for reading fragment by fragment. Hint: consider using fscanf with a character exclusion set ( %[^#] ). And while I'm hinting, remember why you want to specify a limit whenever reading into a string.
Data set size. To simplify things, you can assume that the fragments in the data files are at most 1000 characters and a file will contain at most 10,000 fragments. These are large numbers and it is would be extravagant to preallocate storage to read in 10,000 x 1000 characters when most files will be much smaller. It is acceptable to declare an array of 10,000 char pointers and allocate storage for the strings as needed.
String-handling. Fragments should be represented using C-strings. You may manipulate the string data using character array access, char * pointers, the standard string library functions (strncmp, strcat, sprintf, etc.), or a combination of strategies. By the time you're done, you should have a good understanding of how to handle C-strings!
Memory. Your program should properly deallocate heap memory when no longer in use. Running your program under the valgrind tool can help you find memory leaks (memory that was allocated and abandoned) as well as give you information about misuse of memory. You can learn about valgrind from the links on the CMSC 313 resource page. The quick start page from the valgrind user's manual seems particularly helpful. Running your program under valgrind is as simple as:
```
	valgrind --tool=memcheck --leak-check=full program-to-execute args
```
Note that your program will run much more slowly under valgrind because of all of the extra checking. Your goal is to reduce "definitely lost" and "probably lost" to zero. Hint: it's good programming practice (and helpful for debugging) if memory is allocated and freed in the same function (although in some complex applications this may not be possible).
Efficiency. The focus of this program is not efficiency. It is totally fine (expected, in fact), that each round will re-examine all N^2 pairings to find the maximal overlap. For files with a large number of long fragments, this can get painful. If you're looking for an extra challenge, you are welcome to implement a more streamlined approach that avoids the crazy amount of recalculation. If you do implement a more streamlined approach, submit a README.txt file describing your approach and requesting extra credit from the graders.
Design/style/readability. By now, you should well-versed in good coding practice and we expect your solutions to show thoughtful design and attention to readability. You know the drill: decompose into managable functions, comment where appropriate, factor common code to avoid duplication, no global variables, strive for readability in all ways, and so on.

Getting started

Copy the file Project2.tar from Mr.Frey's public directory (/afs/umbc.edu/users/f/r/frey/pub/313/proj2) to your working directory for Project2. A .tar file is a Unix "Tape ARchive" file, originally used to compact files before they were written to magnetic tape for off-line storage (archiving). Now days, tar is used to package files together for copying, emailing, etc. You could of course, use a compression program like gzip to make the .tar file smaller. Unpack the file using the command tar xvf Project2.tar. This command will deposit several files in your directory. Project2.tar contains a skeleton makefile, a working executable named Project2, and a few sample fragment data files. You will want to create additional testing data files during development. Feel free to share them with your friends and/or post them as an attachment to a message on Blackboard.

Submit

When you're done and ready to hand in your finished submission, don't forget to submit your work! Submit all .c files, .h files, and your makefile. Use the UMBC submit command. The name of this project is Proj2. Remember that only one member of your team should submit the files.