UMBC CMSC 313 -- Assembly Language Segment Previous | Next


The following article was on the Internet and I thought it was worthy for your time to read. It is offered without modification, except to reformat it as a web page instead of just text.

Writing A Useful Program With NASM

by Jonathan Leto

Version 1.0 - Sun Dec 17 17:46:38 EST 2000

Intro

Much fun can be had with assembly programming, it gives you a much deeper understanding about the inner workings of your processor and kernel. This article is geared towards the beginning assembly programmer who can't seem to justify why he is doing something as masochistic as writing an entire program in assembly language. If you don't already know one or more other programming languages, you really have no business reading this. Many constructs will also be explained in terms of C. You should also be familiar with the command line options of NASM, no sense going over them again here.

Getting Started

So you want to write a program that actually DOES something. "Hello, world" isn't cutting it anymore. First, an overview of the various parts of an assembly program: (For terse documentation, the NASM manual is the place to go.)

The .data section

This section is for defining constants, such as filenames or buffer sizes, this data does not change at runtime. The NASM documentation has a good description of how to use the db, dd, etc. instructions that are used in this section.

The .bss section

This section is where you declare your variables. They look something like this:
filename:	resb	255 	; REServe 255 Bytes
number:		resb	1	; REServe 1 Byte
bignum:		resw	1	; REServe 1 Word (1 Word = 2 Bytes )
longnum:	resd	1	; REServe 1 Double Word 
pi:		resq	1	; REServe 1 double precision float
morepi:		rest	1	; REServe 1 extended precision float

The .text section

This is where the actual assembly code is written. The term "self modifying code" means a program which modifies this section while being executed.

In The Beginning ...

The next thing you probably noticed while looking at the source to various assembly programs, there always seems to be "global _start" or something similar at the beginning of the .text section. This is the assembly program's way of telling the kernel where the program execution begins. It is exactly, to my knowledge, like the main function in C, other than that it is ot a function, just a starting point.

The Stack and Stuff

Also like in C, the kernel sets up the environment with all of the environment variables, and sets up **argv and argc. Just in case you forgot, **argv is an array of strings that are all of the arguments given to the program, and argc is the count of how many there are. These are all put on the stack. If you have taken Computer Science 101, or read any type of introductory computer science book, you should know what a stack is. It is a way of storing data so that the last thing you put in is the first that comes out. This is fine and dandy, but most people don't seem to grasp how this has anything to do with their computer. "The stack" as it is ominously referred too, is just your RAM. That's it. It is your RAM organized in such a way, so that when you "push" something onto "The stack", all you are doing is saving something in RAM. And when you "pop" something off of "The stack", you are retrieving the last thing you put in, which is on the top.

Ok, now let's look at some code that you are likely to see.

        section	.text		; declaring our .text segment
global	_start	; telling where program execution should start

_start: 		; this is where code starts getting exec'ed
	pop	ebx	; get first thing off of stack and put into ebx
	dec	ebx	; decrement the value of ebx by one
	pop	ebp	; get next 2 things off stack and put into ebx
	pop	ebp
What does this code do? It simply puts the first actual argument into the ebx register. Let's say we ran the program on the command line as so: $ ./program 42 A When where are on the _start line, the stack looked something like this:
-----------
| 3       |	The number of arguments, including argv[0], 
|	  |	which is the program name
-----------
|"program"|	argv[0] 
-----------
| "42"    |	argv[1] NOTE: This is the character "4" and "2",
|	  |     not the number 42
-----------
| "A"     |	argv[2]
-----------
So, the first instruction, "pop ebx", took the 3, and put it into ebx. Then we decrement it by one, because the program name isn't really an argument.

Depending on if you need to later use the argument count later on, you will see other arguments put into either the same register or a different one.

Now, "pop ebp" puts the program name into ebp, and then the next "pop ebp" overwrites it, and puts "42" into ebp. The last value of ebp is not preserved, and since you have popped it off of the stack, it is gone forever.

Doing more interesting things

Moving on, how exactly do you interact with the rest of the system? You know how to manipulate the stack, but how to you get the current time, or make a directory, or fork a process, or any other wonderful thing a Unix box can do? I am pleased to introduce you to the "system call". A system call is the translator that lets user-land programs (which is what you are writing), talk to the kernel, who is in kernel-land, of course. Each syscall has a unique number, so that you can put it into the eax register, and tell the kernel "Yo, wake up and do this", and it hopefully will. If the syscall takes arguments, which most do, these go into ebx, ecx, edx, esi, edi, ebp, in that order.

Some example code always helps:

	mov	eax,1		; the exit syscall number
	mov	ebx,0		; have an exit code of 0
	int	80h		    ; interrupt 80h, the thing that pokes the kernel
				    ; and says, "do this"
The preceding code is equivalent to having a "return 0" at the end of your main function. Ok, ok, still not very useful, but we are getting there.

A more useful example:

	pop	ebx		; argc
	pop	ebx		; argv[0]
	pop	ebx		; the first real arg, a filename 


	mov	eax,5		; the syscall number for open()
				; we already have the filename in ebx

	mov	ecx,0		; O_RDONLY, defined in fcntl.h

	int	80h		; call the kernel

				; now we have a file descriptor in eax

	test	eax,eax		; lets make sure it is valid
	jns	file_function	; if the file descriptor does not have the
				; sign flag ( which means it is less than 0 )
				; jump to file_function

	mov	ebx,eax		; there was an error, save the errno in ebx
	mov	eax,1		; put the exit syscall number in eax
	int	80h		; bail out

Now we are starting to get somewhere. You should be starting to realize that there is no black magic or voodoo in assembly programming, just a very strict set of rules. If you know how the rules work, you can do just about everything. Though I haven't tried it, I have seen network coding in assembly, console graphics ( intros! ), and yes, even X windows code in assembly.

So where do find out all of the semantics for all of the various system calls? Well first, the numbers are listed in asm/unistd.h in Linux, and sys/syscall.h in the *BSD's. To find out information about each one, such as what arguments they take and what values they return, look no further that your man pages! I will hold your hand in finding out about the next syscall we are going to use, read() .

"man read" didn't give you exactly what you wanted did it? That is because program manuals and shell manuals are shown before the programming manuals are. If you are using bash, you probably are looking at the BASH_BUILTINS(1) man page. To get to what you really want, try "man 2 read". Now you should be looking at sections like SYNOPSIS, DESCRIPTION, DESCRIPTION, ERRORS and a few others. These are the most important. Take a look at synopsis, it should look like:

ssize_t read(int fd, void *buf, size_t count); NOTE: ssize_t and size_t are just integers .

The first argument is the file descriptor, followed by the buffer, and then how many bytes to read in, which should be however long the buffer is. For the best performance, use 8192, which is 8k, as your count. Make your buffer a multiple of this, 8192 is fine. Now you know what to put in your registers. Reading the RETURN VALUE section, you should see how read() returns the number of bytes it read, 0 for EOF, and -1 for errors.

file_function:
	mov	ebx,eax		; sys_open returned file descriptor into eax
	mov	eax,3		; sys_read  
				; ebx is already setup
	mov	ecx,buf		; we are putting the ADDRESS of buf in ecx
	mov	edx,bufsize	; we are putting the ADDRESS of bufsize in edx	

	int 	80h		; call the kernel

	test	eax,eax		; see what got returned
	jz	nextfile	; got an EOF, go to read the next file
	js	error		; got an error, bail out

				; if we are here, then we actually read some bytes

Now we have a chunk of the file read ( up to 8192 bytes ), and sitting in what you would call an array in C. What can you do now? Well, the first thing that comes to mind is print it out. Wait a sec, there is no man page for printf in section 2. What's the deal? Well, printf is a library function, implemented by good ol' libc. You are going to have to dig a little deeper, and use write(). So now you looking at the man page. write() writes to a file descriptor. What the hell good does that do me? I want to print it out! Well, remember, everything in Unix is a file, so all you have to do is write to STDOUT. From /usr/include/unistd.h, it is defined as 1 . So the next chunk of code looks like:
	mov	edx,eax		; save the count of bytes for the write syscall
	mov	eax,4		; system call for write
	mov	ebx,1		; STDOUT file descriptor
				; ecx is already set up
	int	80h		; call kernel

	; for the program to properly exit instead of segfaulting right here
	; ( it doesn't seem to like to fall off the end of a program ), call 
	; a sys_exit

	mov	eax,1
	mov	ebx,0
	int	80h
What you have now just written is basically "cat", except it only prints the first 8192 bytes.

Portability

In the preceding section, you saw how the call the kernel in Linux with NASM. This is fine if you are never ever going to use another operating system, and you enjoy looking up the system kernel numbers, but is not very practical, and extremely unportable. What to do? There is a great little package called asmutils started by Konstantin Boldyshev, who runs http://www.linuxassembly.org . If you haven't read all of the good documentation on that site, that should be your next step. Asmutils provides an easy to use and portable interface to doing system calls in whichever Unix variant you use ( and even has support for BeOS.) Even if you aren't interesting in using these Unix utilities that are rewritten in assembly, if you want to write portable NASM code, you are better off using it's header files than rolling your own. With asmutils, your code will look like this:
	%include "system.inc"	; all the magic happens here

	CODESEG			; .text section
	
START:			; always starts here

	sys_write STDOUT,[somestring],[strlen]

	END			; code ends here
This is much more readable then doing everything by system call number, and it will be portable across Linux, FreeBSD, OpenBSD, NetBSD, BeOS and a few other lesser known OS's. You can now use system calls by name, and use standard constants like STDOUT or O_RDONLY, just like in C. The "%include" statement works precisely as it does in C, sourcing the contents of that file.

To learn more about how to use asmutils, read the Asmutils-HOWTO, which is in the doc/ directory of the source. Also, to get the latest source, use the following commands:

export CVS_RSH=ssh
cvs -d:pserver:anonymous@cvs.linuxassembly.org:/cvsroot/asm login
cvs -z3 -d:pserver:anonymous@cvs.linuxassembly.org:/cvsroot/asm co asmutils
This will download the newest, bleeding edge source into a subdirectory called "asmutils" of your current directory. Take a look at some of the simpler programs, such as cat, sleep, ln, head or mount, you will see that there isn't anything horrendously difficult about them. head was my first assembly program, I made extra comments on purpose, so that would be a good place to start.

Debugging

Strace will definitely by your friend. It is the easiest tool to use to debug your problem. Most of the time when writing in assembly, other that syntax errors, you will just get a segmentation fault. This provides you with a ZERO useful information. With strace, at least you will see after which system call your program is choking. Example:
	$ strace ./cal2
	execve("./cal2", ["./cal2"], [/* 46 vars */]) = 0
	read(1, "", 0)                          = 0
	--- SIGSEGV (Segmentation fault) ---
	+++ killed by SIGSEGV +++
Now you know to look after your first read system call. But it starts getting tricky when you have lots of pure assembly, which strace cannot show. That's when gdb comes into play. There is some very good information about using gdb and enabling debugging information in NASM in the Asmutils-HOWTO, so I won't reproduce it here. For a quick and dirty solution, you could do something like this: %define notdeadyet sys_write STDOUT,0,__LINE__ Now you can litter the source with notdeadyet's, and hopefully see where things are going astray with the help of strace. Obviously this is not practical for complex bugs or voluminous source, but works great for finding careless mistakes when you are starting out. Example:
	$ strace ./cal2
	execve("./cal2", ["./cal2"], [/* 46 vars */]) = 0
	write(1, NULL, 16)                      = 16
	write(1, NULL, 26)                      = 26
	write(1, NULL, 41)                      = 41
	--- SIGSEGV (Segmentation fault) ---
	+++ killed by SIGSEGV +++
Now we know that we are still going on line 41, and the problem is after that.

Next ?

Now it is your turn to explore the insides of your operating system, and take pride in understanding what's really going on under the covers.

Reference

Places to get more information:

Feedback

Feedback is welcome, hopefully this was of some use to budding Unix assembly programmers.

Availability

The most current version of this document should be available at http://www.leto.net/papers/writing-a-useful-program-with-nasm.txt .

Appendix : Jumps

When I first began looking at assembly source code, I saw all these crazy instructions like "jnz" and the like. It looked like I was going to have to remember the names of a whole slew of inanely named instructions. But after a while it finally clicked what they all were. They are basically just "if statements" that you know and love, that work off of the EFLAGS register. What is the EFLAGS register? Just a register with lots of different bits that are set to zero or one, depending on the previous comparison that the code made.

Some code to set the stage:

	mov	eax,82
	mov	ebx,69

	test	eax,ebx
	jle	some_function
What on earth is "jle"? Why it's "Jump if Less than or Equal." If eax was less than or equal to ebx, code execution will jump to "some_function", if not, it keeps chugging along. Here is a list which will hopefully shed some light on this part of assembly that was mysterious to me when I began. Some of these are logically the same, but are provided because is some situations one will be more intuitive than the other.

JumpMeaningSignedness (S or U)
jaJump if aboveU
jaeJump if above or EqualU
jbJump if belowU
jbeJump if below or EqualU
jcJump if Carry  
jcxzJump if CX is Zero 
jeJump if Equal 
jecxzJump if ECX is Zero 
jzJump if Zero 
jgJump if greaterS
jgeJump if greater or EqualS
jlJump if lessS
jleJump if less or EqualS
jmpUnconditional jump 
jnaJump Not aboveU
jnaeJump Not above or EqualU
jncJump if Not Carry 
jncxzJump if CX Not Zero 
jneJump if Not Equal 
jngJump if Not greaterS
jngeJump if Not greater or EqualS
jnlJump if Not lessS
jnleJump if Not less or EqualS
jnoJump if Not Overflow 
jnpJump if Not Parity 
jnsJump if Not signed 
jnzJump if Not Zero 
joJump if Overflow 
jpJump if Parity 
jpeJump if Parity Even 
jpoJump if Parity Odd 
jsJump if signed 
jzJump if Zero 


Previous | Next

©2004, Gary L. Burt