What is data and what is code? Knowing a little about computers, this appears to be an easy question to answer. Code is what we run (programs) and data is what we feed the programs with (images or text files or whatever). This seems obvious and clear-cut.
Turns out it isn’t.
In fact, the dominant computer architecture, the von Neumann architecture, pretty much relies on the fact that code and data are one and the same. The main thing here is that code and data are stored in the same memory space. A major benefit of that is flexibility; if you have small programs and a lot of data, you can use the same computer as when you have a large program without any data.
It also allows for programs creating and/or modifying code on the fly. Even if computer viruses employ this trick to evade detection by mutating, its main impact on computing is something much, much better. Namely, the concept of compilers.
The idea behind a compiler is that computers are annoying as hell to instruct. The instructions have to be presented to the processor in number form (this is called machine code). Since the whole point of a programmable computer is to make it as flexible as possible, these instructions are very, very generic. We’re talking things like “move this byte to this memory cell” and “add this number to this number”, not anything even close to something as seemingly trivial as “write this number on the screen.”
In the beginning, this wasn’t that big a problem. Programs were very short, because computer memory didn’t allow for more. But as programs grew larger and more complex, they became harder and harder to plan and organize, with mistakes growing ever more difficult to find and correct. That’s when programming languages started to gain popularity.
But what is a programming language? Well, it’s sort of a compromise between machine code and something humans can easily understand. It was also a higher level of abstraction. Instead of saying “move this number here and this number here and then start executing this code there and then go there…” and so on, they could just write things like:
PRINT "What's your name?" INPUT NAME$ PRINT "Hello" + NAME$
The program written in such a human-readable form is called the “source code”, because it is then fed as source data into a program that translates it into machine code. This program is a compiler, and the process is called compiling.
An added bonus to its being easier to write and maintain is that it also becomes portable. Being portable here means that it’s easy to move across platforms. A platform is the environment in which the program runs; a combination of operating system and hardware. With the compiler concept you add an abstraction layer, making the program runnable on any system which provides a compiler for it, but with the added, one-time cost of having to recompile it.
Code Or Data?
But wait a second. The compiler treats the source code as data. Is it still code? I mean, it’s not runnable in its current form, is it? Well… in fact it kind of is. It depends on what you mean by “run.”
Look at it from the view-point of the processor. From there, everything looks like numbers. Everything. Even the programs we run are just long strings of numbers. So one could argue that the only true program in a computer is the hardware. The rest is just data telling the processor what to do.
So imagine if we made a processor that could read the source code of some programming language (like the BASIC I used in the example) and execute it right away. Here the source code would act like code and not data.
This is sometimes done in the software layer. There is a type of application called an interpreter. As the name suggest it interprets source code of a certain language and runs it, without first compiling it. What is actually happening here is that the interpreter acts as a virtual computer having the selected programming language as its native machine code.
And what if we on that virtual machine made another virtual machine that supported C or Java or some other programming language… It soon becomes clear it’s just a matter of perspective as to when the code is data and when it is, in fact, code.
Data As Code
But here we arrive at a very interesting thought. If a program can be said to be data that tells another program what to do, then isn’t any data also a program? Let’s consider a simple image format.
We’ll be using greyscale for simplicity’s sake and further employ an extremely simple encoding scheme; let’s say every pixel is represented by not one but two numbers; the first deciding how many times the value of the second number should be repeated.
So if the first number is 10 and the second 3 it means we should repeat the number 3 ten times. The compression ratio would really suck if every other pixel was a different colour, so some other mechanism would probably be put in to minimize the worst-case scenario. But for images containing long runs of same-colour pixels, it’s rather nice.
Anyhoo. If we now feed the decoder with this compressed data, we get a certain output, the decompressed data. Think about how the program does this. It reads numbers in a certain order, making decisions depending on what it reads. In this case, the only decision is how many times to repeat something, but we could have written a much more complex scheme that would perhaps interpret numbers as instructions to draw shapes in sizes defined by the following ten or so numbers.
The specifics aren’t important. What’s important is that the processor does the exact same thing to the machine code of the decoder program. It reads it as a string of numbers and acts differently depending on the numbers it reads.
So you could say that an MP3 file is a program written in a highly specialized language, running on the virtual machine layer of your favourite music player.
I find the thought extremely fascinating.