CMSC 421 (Spring 2003) : Project

Programming Project

CMSC 421, Spring 2003

Assigned: 29 Jan 2003

Goals

The aim of this project is rather simple -- to make you work with a real operating system (in the present instance, Linux) , and to understand how to modify and add functionality to it. The project will have three phases, with the final phase due only towards the very end of the semester.

Phase 1 , Due Feb 14, 2003

This project is intended to get your feet wet with building a linux kernel and installing it. We have given out a linux distribution on CD, and also provided a web page with instructions on how to obtain, compile and install the kernel. The basic idea is to download and build your own kernel. Since most current releases of Linux use the 2.4 kernel series, we'll ask you to use the 2.2 series so that it is clearly different. To make the TAs life easier, we ask that all of you use the 2.2 kernel stable at the time of writing this document, viz the 2.2.23.

What to hand in

Absolutely nothing! No credits are assigned for this phase -- this is simply catch up time for those in the class not yet familiar with installing, partitioning, dual booting etc. You may want to use the local Linux Users Group as a resource and perhaps join their installfest. You are free (for this phase only) to seek whatever help you need from whomever you please.

Phase 2, Due Mar 16, 2003

We assume that you have installed the 2.2.23 kernel. This document will describe a new function that we want you to add to the kernel as a system call. The exercise is fairly straightforward, and you'll add in no more than 50 lines of codes/headers etc -- probably less. The idea is to make sure you understand the mechanics of modifying the kernel.

We assume that you are already familiar with makefiles and debugging from classes such as CMSC 341. If not, this will be a considerably more difficult project because you will have to learn to use these tools as well.

We ask you to implement a two new system calls, setmykversion and getmykversion. The former should set the value of a kernel variable called kv421 to an integer that you pass, the later should return the current value of kv421 in an integer variable that you pass to the function as an argument. The function itself is fairly straightforward, but there are several minor differences between writing a kernel function and a regular user space function. We provide some hints below. In addition, you'll need to write a regular program in user space that will prompt the user to give a value, set the qv421 kernel variable to it, read it back, and write it on screen. In effect, a simple driver program to show that your system calls work.

Helpful Hints

By default, code for linux exists in/usr/src/linux. If you have multiple versions of the kernel, the code may exist in /usr/src/linux-version instead.
Recall that a system call is a software trap or interrupt. This means that when adding a new system call, you will need to update the system call table (i.e. the interrupt vector) with a new entry, and generate a stub that will be used by user programs. Look for files arch/i386/kernel/entry.S and inlude/asm/unistd.h, and look for the macro _syscall2 .
In the critical section of the code you write (e.g. when you access the kernel variable(s)), you should consider avoiding race conditions, perhaps by disabling and enabling the interrupts.
A kernel function is just like any other function. The header declaration has a minor difference -- the keyword asmlinkage preceeds the declaration, e.g. asmlinkage void sys_foo(void).
You will need to dereference a user space pointer in the kernel when trying to pass the value back in the argument. Look at the functions copy_to_user() and copy_from_user() to help with this.

Mechanics, and what to hand in

You will need to hand in all of your code and documentation using the submit programs available on the GL cluster. In particular, hand in the tar/gzipped version of a patch to your modified kernel, your driver program, and your readme.

In addition, your design documentation is due one week before the final due date, by 11:59 PM on 7 Mar 2003 . We are enforcing this deadline to ensure that people don't leave the project until the last minute. You are, of course, welcome to visit either the faculty or TA office hours for help; however, one of the first things we'll ask for is your design documentation (unless you're asking for help with that...). You may make changes to your documentation before the full phase 2 handin; however, the design portion of your grade will depend heavily on the design document you handin on Mar 7^th.

Your design documentation, typically 1-2 pages for a project of this size, should include the basic design of your software (what modules will you write, where will you make changes to the kernel etc.), a timeline, as well as details on the testing that you plan to do to ensure that your code works.

The assignment name to use with submit for the documentation is p1doc, and for the project code is p1code.

Rules for Collaboration

This is a pretty straightforward project with a minimal amount of code to be written. We therefore DO NOT expect that there will be occasion for you to discuss solution strategies with others. If you do not do this project yourself, then there is a good chance that you will not learn how to modify the kernel, and will certainly fail in the next phase. Please recall that academic dishonesty will be sternly dealt with.

Final Phase, Due May 12, 2003

Now comes the real stuff! In this phase, you will implement a rudimentary intrusion detection system in the kernel. Attacks on computers are a very significant problem today, and likely to grow worse in the future. Intrusion Detection Systems (IDSs) seek to detect attacks and possibly take remedial action.

There are several different ways of doing this. One interesting approach keeps track of system calls. In particular, the idea is that when someone "rootkits" a computer or subverts an existing process using say a buffer overflow, this can be detected by looking at the system calls the process makes and comparing it to the calls that a "healthy" process would make. A healthy process will typically exhibit a small set of behaviors. Each of these behaviors can be characterized by a "usual" sequence of system calls. A naïve way to do this is to see if a system call was made by a process that it does not make.

More interestingly, we could build a "reference log" of normal sequences for the process. Given a stream of system calls made by the process, a sequence is a group of system calls in the stream as seen from a fixed sezed window. Consider a sample stream of system calls: open(O), read(R), mmap(M), write(W), close(C).

O R M M W R C

This stream as seen from window sized 5 gives following sequences.

O R M M W
R M M W R
M M W R C

If we could construct a reference log for the particular environment, process behavior can be monitored for aberrations. In particular, a sequence that is not in the reference log indicates an aberration. Experimental evidence suggests that short sequences of system calls (of length six or more) provide a signature for normal behavior and that signature has a high probability of being perturbed during intrusions.

In this project, we will implement a simpler scheme. You will instrument the interrupt dispatcher so that each time a system call is made, you will log the pid of the calling process and the actual call made. Write a separate user space process that will monitor this log and compare it with the reference logs of each process. (Construct a reference log by observing the process in healthy state. There are various ways to store the logs e.g. flat file, database etc.). The user space process will look at the last 'k' entries in the log. If a particular system call is made in this window, you will set the bit corresponding to that call to '1'. You must decide value of 'k' that you will use.

Remember that a healthy process will not exhibit exactly the same behavior as in your reference log i.e. the exact system calls made will vary slightly. So in order to avoid false alarms, we must decide if the current sequence is aberrant enough to be classified as intrusion. Thus we need to define "distance" between two sequences. If the distance between the observed sequence and sequence in the log crosses some predefined threshold, then we can flag an intrusion. One simple way to define distance between two sequences is Hamming Distance. The Hamming distance between two ordered sequences is the number of places where the sequences are not identical. e.g. distance between 1010 and 0101 is 4 while distance between 1011 and 1101 is 2. For our purpose, consider a reference log where open, read, write, mmap and close (1, 1, 1, 1, 1) and observed log with open, read and close (1, 1, 0, 0, 1). The hamming distance between these two logs is 2.

For extra credit (20 %), build the system described in Intrusion Detection using Sequences of System Calls . You may implement other systems described in literature after verifying with Dr. Joshi

Mechanics, and what to hand in

In addition, your design documentation is due by 11:59 PM on April 6, 2003. Here are the guidelines for the design document. You are, of course, welcome to visit either the faculty or TA office hours for help. As usual, one of the first things we'll ask for is your design documentation. You may make changes to your documentation before the full phase 3 handin; however, the design portion of your grade will depend heavily on the design document you handin on April 2^nd.

Your design documentation should typically be 5-6 pages for a project of this size. It should include the basic design of your software (what modules will you write, where will you make changes to the kernel, what data structures you will use, which format will be used for storing the logs etc.), a timeline, as well as details on the testing that you plan to do to ensure that your code works.

The assignment name to use with submit for the documentation is p2doc, and for the project code is p2code.

Rules for Collaboration

You are allowed to discuss ideas your classmates. No code exchange is allowed. If you use any external sources (like internet) you must cite them. Again, academic dishonesty will be sternly dealt with.

Grading the Project

The intent of the grading for the project is not to differentiate among those students who do a careful design and implementation of the assignments. Rather, the grading helps us identify those students who (i) don't do the assignments or (ii) don't think carefully about the design, and therefore end up with a messy and over-complicated solution. Remember that you can't pass this course without at least making a serious attempt at each of the assignments. Further, the grading is skewed so that you will get substantial credit, even if your implementation doesn't completely work, provided your design is logical and easy to understand. This means that you should first strive to come up with a clean design of your project on paper. Second, don't try to add fancy features because some other group is!

The grading for the project will be as follows: 40% design, 60% implementation. We have structured the grading in this way to encourage you to think through your solution before you start coding. If all you do is to work out a detailed design for what you would do to address the assignment (and if the design would work!), but you write no code, you will still get almost half of the credit for the assignment. The implementation portion of the grade considers whether you implemented your design, ran reasonable test cases, and provided documentation that the TA could understand. Part of being a good computer scientist is coming up with simple designs and easy to understand code; a solution which works isn't necessarily the best that you can do. Thus, part of the design and implementation grade will be based on whether your solution is elegant, simple, and easy to understand.

Helpful Hints

You may decide to log the calls from all the processes and then let the user program to filter out the system calls made by other processes or you can only log the system calls made by some specific process. Remember that you are not allowed to "hard code" the pid. Do not hard code the name of the log file either.
There are a few ways to log the system calls. There is a simple mechanism called ptrace which allows tracing the system calls. You are NOT allowed to use it. This project must be in the kernel. We list couple of ways to do so. If you want to do something else, you must verify with Dr. Joshi before proceeding.
You can implement a loadable kernel module that replaces the function pointer for interrupt 0x80 in the Interrupt Descriptor Table with your own function pointer. (For linux on i386 architecture, system calls are invoked by a software instruction which generates interrupt 0x80.) The normal system call handler function is defined in the file entry.S with the label "ENTRY(system_call)". The replacement interrupt handler will do the logging and then call the original function. To perform the logging, the module must be implemented as a character device that allocates a buffer (use kmalloc and kfree), and use ioctl() to send the stream to a user space program. The user program does the logging. With this approach, do NOT monitor the system calls made by all the processes. (Why ?)
Another way is to modify the system call handler itself. Whenever a system call is made, system_call function defined in entry.S is called. The eax register has the system call number and it is used to index into the system call table and call the function that implements the system call. Modify system_call() function to call your logging function. Remember not to use fopen(), fread(), fwrite() functions (or open() and friends) to do the logging. Use open_namei function to open the file. This function returns a file descriptor. You have to figure out the exact details about "writing" the log file.
When plugging in your code inside assembly code, don't forget to save the registers before you make any changes and to restore them after you are done.
Linux Cross-Referencing project is a very good place to browse the kernel sources.

More Helpful Hints

Do not use the mechanisms used to copy user space buffers to kernel buffers when copying data from kernel space buffer to another kernel space buffer.
On x86 platforms Linux uses FS register to keep the data segment register with which access user segment register. If a system call is made and the kernel needs to copy data from user space then FS will have the value USER_DS. In order to use the function to copy/return data from/to kernel space, FS must have value KERNEL_DS.