Main.scm |
Main.scm |
mainscm_overview.html |
Main.scm Overview - scm concepts and structure (by Xentor)
The first thing you need to understand is that MAIN.SCM is NOT just a normal script, in the most well-known sense of the word. It is NOT a text file, and the commands are neither structured nor readable. The opcodes in main.scm are closer to assembly language (ASM) than anything else.
So for those of you who don't know much about low-level programming, here's how it works. The code part of the program is basically nothing but a list of opcodes with parameters. Each opcode is a single command or function call. An example might be "x = y" or "jump to x" or any of a thousand other codes. The problem, of course, is that we don't know what every opcode is. When you look at the SCM, all you see are 2-byte integers, each of which corresponds to a particular opcode.
In addition to opcodes, we have a large number of variables. If you've programmed before, you definitely know what a variable is. You probably refer to them as x, y, count, index, etc. Well, since MAIN.SCM is a compiled program, the variables are nothing but memory addresses. We can track a particular variable through the program, but we don't know the original name of it.
So let's recap. We have a lot of integers that correspond the opcodes, but we don't know what many of those opcodes are. We have a lot more integers that correspond to variables, but we don't know what many of those variables mean. This is what has to change. There is a small group of people who are putting a great deal of time into figuring these things out, and more help is always appreciated.
Blank Space |
Object Names |
Thread Pointers |
Program Code |
As you see, the entire SCM isn't just code. It's divided into four segments, three of which are useful to us. Segment 1 is mainly black space, most likely for temporary variable storage in memory. Segment 2 is a list of object names. Segment 3 is a list of thread pointers. Segment 4 is the actual program code.
IMPORTANT NOTE: All numeric values are stored little-endian, aka least significant byte first! If your system operates in big-endian, you will have to reverse byte orders before reading!
Segment 1: Blank SpaceAs the description suggests, this section is basically all zeroes. Seems pretty useless, right? Well, for us, it is. Most people think the SCM is loaded into memory as one large block, and that empty space is for temporary variable storage. All we need is the very beginning of the file, which contains a JUMP code with the address of the object segment (Segment 2).
Segment 2: Object NamesThis segment, like the first, begins with a jump code. This jump code, when followed, will take you to the thread segment (Segment 3). Of course, there's more here than a jump code! Following the jump code is a single null-byte (00), and immediately following that is a 4-byte integer containing a number (Almost 190 in the original SCM). This is the number of objects you will find in the segment! Following that, you'll find a list of that many 24-byte strings, each of which is the name of an object. You'll notice the first 24 bytes are zeroes. We speculate that the object array used in the original code was indexed from zero, but they started using elements at index 1, so those first 24 bytes are the unused Object[0]. This IS included in the element count (The preceding 4 bytes), so keep that in mind! Also, notice that the strings that don't fill the entire 24-byte space are terminated with zeroes (They still use the entire field, but the rest is filled with zeroes). Some languages have problems with that, so if you're decompiling, you might have to load the strings in one character at a time until you reach a zero.
Segment 3: Thread PointersThe threads are organized similarly to the object names. First, you have a JUMP code which will lead you to the code segment, then, again, a single null byte. Following that, a 4-byte integer referencing the location of the intro thread. Next, there is a 4-byte integer that, as was recently deduced, may have something to do with the size of the main code segment (From the beginning of the code segment to the first thread refererence. See the Jump Addresses section for more info). After that, you have a 4-byte integer with the number of thread pointers, and then a list of that many 4-byte pointers. Yes, it's another array. We'll discuss the use of this segment later on, in the Jump Addresses section. Interestingly, the "Intro" pointer is duplicated in the first element of the thread array.
Segment 4: Program CodeWelcome to the segment that most of the hard work has gone into. This is the actual mission script in compiled form. In a way, this is the most straightforward of all the segments, since it's merely a list of opcodes that keeps going until the end of the file. Simple, huh?
| 05 | 00 | 02 | 08 | 00 | 06 | 01 | 00 |
| Opcode Variable Assignment |
P-Type Variable |
VarID $0008 |
P-Type Fixed-Point |
Value 0.0625 | |||
| $0008 = 0.0625 | |||||||
Looks pretty complicated, huh? Well, it's not that bad once you understand it. As you can see, the first two bytes represent the op-code itself. In this case, it shows &H0005 (Little-endian, so reverse the bytes). Oh, in case you're wondering, "&H" is used to symbolize a hexadecimal number (Base 16). If you're not used to working in hex, you better learn quick!
So how did I take opcode &H0005 and figure out that it stands for variable assignment? Well, that's where SCM.INI comes in. You won't find this in your GTA3 directory, of course, since it was made by the group cracking main.scm. Check the links section at the bottom for places to download this all-important file. The SCM.INI is basically a list of opcode and variable numbers, and what they correspond to. For instance, in the scm.ini, you'll find this line near the beginning:
0005=2,%1d% = %2d%;So what does this mean? Well, the first four characters, "0005" are obviously the opcode. After that is an equals sign (=), then a number. This number (2 in this case) is the number of parameters this opcode uses (I'll go into more detail on parameters in the next section). After that is a text string with the meaning of the opcode. The %1d% and %2d% are where the variables should be placed when the code is translated. The "d" means a number or variable, so %1d% is replaced by the first parameter, read in this case as a variable.
Of course, what is variable $0008? Well, as of when I write this, we don't know yet. When we figure it out, you'll find it in the second section of the scm.ini file. Variable definitions look like this:
0210=player_charA bit simpler than the opcodes. The part before the "=" is the variable ID (0210 for this variable), and the second part is the name. So whenever we see variable $0210, we replace it with $player_char, so we know what's going on. $0008 is unknown, so we just leave it as-is.
Keep in mind there is no terminator for most opcodes. After the last parameter, the next code starts immediately!
Of course, there are special cases. When the scm.ini shows the number of parameters as "-1", it means the opcode takes a variable number of parameters. Annoying, isn't it? Well, in this case, all you do is keep reading parameters, one at a time, until you reach a parameter of type 0. That byte is the terminator, and the next opcode starts on the following byte.
NOTE: Negative opcodes (&H8xxx) stand for negation. In this
case, add 32768 to make it positive, and set a NOT flag. This is used
within comparisons of IF statments to specify "NOT
So, you're probably wondering how parameters are stored, right? Well, this is pretty simple. The parameter type (P-Type in the example in the previous section) is one byte that tells you what kind of data is stored in the following byte(s), and how many bytes this parameter occupies. Here's what the different P-Types mean:
Jump and jump_if_false opcodes have an address pointer as a parameter. This, of course, refers to another location in the file. It simply means that the next instruction (opcode) to process is at that location in the SCM file.
The issue, of course, is negative pointers, and how to process them. I don't yet understand the reasoning behind the following method, so I won't try to explain why it's done this way. This method, courtesy of CyQ (But paraphrased), seems to work for all negative pointers:
The negative jump addresses have now been explained. We now realize that the original mission script was not one long file, but many smaller ones. The code segment of the SCM is divided into a number of small chunks, one for each story mission, RC mission, and 4x4 mission, and one for each sub-mission type (taxi, paramedic, vigilante, fire). These, along with the intro thread, number 80 in the original file. The thread segment is actually a sort of file manifest, listing the original script files and their locations in the code segment. A negative jump is simply a jump within the current segment, designed that way so in-segment jumps would not have to be recalculated when the chunks are moved or resized. Instead, the jumps are recorded relative to the beginning of the current segment, so once negated can be added to the segment header to deduce the actual address. Future decompilers may involve splitting the code segment into its original components, in which case those addresses will become absolute.
There are other files that the SCM uses to find information. The two used most often are the IDE file (GTA3\data\default.ide) and the GXT file (GTA3\TEXT\yourlanguage.gxt). These are used to translate object names and long text messages, respectively. Note that when you select english as your language, GTA3 actually uses "american.gxt", NOT "english.gxt". The "english" file seems to be nothing but an incomplete version of "american". No offense to the British, but that's just how Rockstar did it.
This reference does not deal with the formats of those files, so if you want more information, visit the File Formats section.
Everyone has their own way of programming, and we use many different languages, so here I'll post different methods of decompiling in different languages, hopefully from different peoples' points of view. If you have your own method, better or worse, in any language, see the contact section and let me know.