My game just went belly up. What's wrong?

From Victoria 1 Wiki
Jump to navigation Jump to search

Just to start off with, no program to date that actually does something significant, is completely free of errors. The games from Paradox are no exception to the rule, and will crash from time to time. What you need to understand here is that today's PC's are highly complex machines with all sorts of combinations of hardware, OS and driver software. It's just not possible (not even for Microsoft itself) to exhaustively test each and every combination. That can only be done (and even then to a certain extend only) on controlled platforms like an Apple Macintosh, a Sony Playstation or a Microsoft XBox.

Now to the failures themselves. Basically, there are three different kinds of failures.

  • At some point in the game, the game time stops advancing and 'pauses' indefinately.
  • All of a sudden, the game stops abruptly, and you end up in the desktop.
  • All of a sudden, the PC locks up completely and becomes unresponsive to everyhing.

Now, lets examine each case in more detail.

The game clock stops

This is (most likely) caused by an internal routine that enters a loop (to search for something, for example) and never exits again. When this happens in the game engine part of the game (that part that deals with the game rules, updating the provinces, evaluating the AI, that stuff), then the current game turn never finishes. Contrary to a chess program, where the AI is simply cut off when the time has elapsed, Paradox games extend the game turn until all (AI) processing is complete. If one of the routines during this phase never exits, then in turn the current game turn never ends. And that means that the game clock stops advancing. It should be treated as a game bug, and, when reproducable, should be reported in the appropriate bug forum. Normal Windows behaviour is still possible, like using <alt><tab> to switch to the desktop.

The game terminates suddenly

This is the most common type of failure. It is also known as a CtD, or Crash to Desktop.

What causes it, you may ask? Well, it can be caused by a lot of things, both under and outside of the game's control. What actually happens is standard Windows behaviour when what is known as an exception occurs. Exceptions are (mostly) fatal occurences that prevent the application (our game in this case) from continuing. Other applications also can have these kinds of fatal interruptions. How an application responds is ultimately a choice for the programmer. When the application does nothing, the exception ultimately ends up in the Windows kernel. The kernel has only two ways of dealing with it. It can either show you a dialog box with a cryptic looking text and an <Ok> button, or it presents you with the dreaded BSOD (or blue screen of death). Either way, the application is dead and terminated.

For a game like the Paradox games, that is not a good solution. Letting the application die by the hands of Windows itself is a very bad idea. When that happens, none of the resources (except memory) that were claimed by the game are released. That means that DirectX remains active, sound buffers are still allocated, and so forth. Thus, the game engine contains a very rudimentary solution. The game engine captures the exception, and gives itself a more or less orderly way out. Since the game cannot continue running (that is, unfortunately, the nature of an exception), the only thing it can do is to release all resources, close down DirectX, and quit. Now, it would have been nice if the game actually produced a (user friendly) message box stating why it closed down, but unless you are an expert or programmer, that would not mean much to you, the user.

The history of exceptions

Some 30 years ago, the first OS for PCs (then called micro computers) appeared, called CP/M. It was designed for computers with a max amount of memory of only 64K. It's OS consisted of three layers: Low level hardware drivers (BIOS), Management of disk contents (BDOS) and a command interpreter (COMMAND). Yes, Microsoft nicked the ideas for MS-DOS from CP/M. Now, CP/M contained a very rudimentary error handling. It's only response was 'BDOS error on x'. Then it waited for a keypress from the user, and once it received that, it rebooted the machine.

Later MS-DOS versions improved on that, and added the famous 'Abort, Retry, Ignore' options to a fatal error message. Choosing Abort would no longer reboot the entire machine, but dump you back to a command prompt. Most of the time you would still manually press <ctrl><alt><del> to reboot anyway, as a lot of things were left in an undefined (aka unusable) state after an abort.

One of the key things in this era is that there is absolutely no processor support to determine if the application is running amok. The code and internal data of the OS is fully exposed in memory, and could be altered at any given time. There also was no support to detect if memory addresses would actually contain RAM or ROM chips. Writing to a non existing location would do nothing, and reading from a non existing location would yield random data (mostly all 1's, but that was due to the implementation of the memory controller).

In the mid 1980's, Windows enters the arena. The first versions (upto 2.03) were running directly on top of MS-DOS and thus offered no protection against problems either. In fact, the systems became more vurnerable to applications running amok, because there were now larger amounts of RAM containg vital, systemwide data, wide open to random access from runaway applications.

At the same time, Intel releases the 80286 processor. This is the first PC processor containing some level of protection (hence the 'protected mode' name). Memory is no longer accessed directly, but routed through so called segment descriptors. Each application (or task) could get it's own, private set of segment descriptors. Manipulation of the memory ranges these descriptors referred to was restricted to tasks running at ring 0 protection level (the highest rating possible). Applications were intended to run at ring 3. So, if the OS was designed to take advantage of these features, no single application could run amok in memory areas belonging to the kernel or another application. OS/2 2.0 and higher indeed took full advantage of this. MS Windows, however, did not.

By the time Windows 3 came out, the 80386 had been introduced. And with it came support for virtual memory. It also meant that there was finally a way for the processor to detect access to non existing memory locations (as this was the basis for the virtual memory logic). Microsoft also was working on Windows NT, which, just as it's OS/2 counterpart, would contain full protection on all fronts. However, IBM's programmers appeared to be better than those from Microsoft. OS/2 could perform quite good on the hardware of those days, while Windows NT (1.0) was unacceptably slow. That prompted Microsoft to remove most of the protection from later versions of Windows NT and Windows 9x.

What remained was the detection of accessing non-existing memory. That would trigger an 'Unrecoverable Application Error' (or UAE for short) in Windows 3.x, after which the application simply terminated. With the introduction of Windows 95, Microsoft claimed it had eradicated those perky UAE's. Well, taken literaly, that was correct. They simply had replaced it with the term 'General Protection Fault' (or GPF) for short, later followed by the term 'Access Violation' (or AV). And for the upcoming Longhorm release (slated for 2007), they once again promise to eliminate the AV. No doublt it simply will be replaced by another term.

Anyway, no matter the term used, an exception is always fatal for the application that caused it. After pressing the <ok> button, the application is terminated. And, depending on what else has been corrupted in our badly protected environment called 'MS Windows', an entire reboot either occurs automatically or is required to regain normal control over the desktop.

However, one thing did change with the introduction of Windows 95 and NT 4.0. Microsoft has added support for application level exception handling. When an exception is triggered by the processor which the OS cannot or should not handle internally (for example, regular page faults are dealt with by kernel's memory manager), it is first routed as a structured exception to the active application. Only when the application does nothing with the exception does Windows fall back to it's build-in default behaviour by showing the message box/BSOD, and terminate the application.

Exceptions on modern day PC's

On a modern day Pentium based PC, there are a couple of other exceptions besides the Access Violation that can happen. All of those are essentially treated the same by our game. The game aborts, and returns control to the desktop. The most common ones will be discussed below. Since the game does not show you (without the help of an advanced debugger) exactly what kind of exception caused the CtD, you have to guess what it was. The list below is sorted by occurence frequency.

Access Violation

The processor was instructed to access a piece of information on a memory location that either does not exist, or that the current process has no access to (for example, because it belongs to a different process). It invariably means that there is a bug somewhere, because normally this should never happen. Accessing memory through an uninitialized pointer can cause this, or accessing memory through a pointer that has been released back to Windows previously. Now, the question that remains is whether this is caused by the game code (and thus is a bug in the game) or a driver. If problem is reproducable in the game (as in: load a save game, do the same steps over and over again, and each time it crashes at the same spot), then it's most likely a problem in the game, and should be reported in the appropriate bug forum.

Page error

The processor was instructed to access a piece of memory that is registered as being stored in a swap file, but for some reason the virtual memory manager could not load it (back) into main memory. It can either indicate a corruption of the paging tables (a malfunctioning device driver can cause this, for example), or the system is low on pagable physical RAM. On Windows 9x platforms (because of the not so monkey proof implementation of the memory manager), Page errors can be reduced or even eliminated completely by installing more RAM.

No memory left

An attempt to allocate a chunk of memory has failed, most likely because available RAM has been exhausted. Not having enough physical RAM, combined with an almost full hard disk partition holding the swapfile, can cause this. It can indicate a memory leak in the game or another application, or simply that too many applications are open.

Invalid handle

An attempt was made to call a Windows function with a handle that (no longer) exists. Most Windows API functions perform a (limited) sanity check on the parameters they receive from the calling application. When something doesn't add up here, this exception can follow. It usually indicates a failure in the application that called the Windows function. Again, if this is reproducible, it should be reported in the appropriate bug forum.

If you are using Windows 9x, then there may be a second reason for this type of exception. On a Windows 9x system, there is a limited amount of memory reserved for allocating Windows resources. Those are the things these handles normally refer to. On a Windows 9x system, a fixed amount of two times 64 KB (yes you read that correctly. It's kilobytes) is systemwide set aside for storing resources like icons, mouse cursors, edit boxes, menu bars and what not. Having lots of applications open will quickly exhaust this limited amount of RAM, causing Windows API functions to fail.

Illegal instruction

An attempt was made to execute an illegal (or non existing) processor instruction. Normally, this can never happen. When it does, it usually means the program entered a random piece of memory, thinking that program instructions are stored there. It's usually an indication that some time before this point something has gone wrong, like a processor stack corruption. This can be caused by a function that tries to access a local buffer outside of it's defined bounds. This is, btw., how virusses misuse buffer overflow vulnerabilities in the various operating systems.

Privileged instruction

Some processor instructions are reserved for the so called supervisor mode. This is a processor mode, reserved for OS kernel routines and key device drivers. Normal applications (including games) run in user mode. In this mode, the privileged CPU instructions may not be executed. If a program attempts this anyway, then this exception follows. It usually indicates that program execution has entered a chunk of code that it wasn't supposed to enter. Again, as with the Illegal instruction, stack corruption is the most likely cause.

Stack overflow

This is a simple one. The memory, reserved for the stack, has been exhausted. Usually this happens when a routine calls itself (directly or indirectly) infinitely. It indicates a logic error in the program or a driver.

floating point failures

This is a collection of related exceptions, all linked to floating point operations. Things like division by zero, taking the square root of a negative value, that sort of thing. Usually indicates an error in the program's logic.

The system freezes completely, leaving the PC unusable

This is a very nasty condition. However, it has very little to do with the game itself, and a lot with the current system configuration. The most common cause of a full system freeze is a condition that has been named 'infinite loop' by Microsoft. This is, in fact, a system failure within the AGP section of your mainboard. Let me explain a bit.

How the AGP interface works

A modern day AGP video card is much more than simply an advanced version of the good old VGA card and it's predecessors. Those were simply dumb frame buffer cards, and all of it's memory contents was manipulated by the CPU. Nowadays, video chips are even more advanced than the main CPU itself. Together with the support chips on the video board they are, in fact, a separate computer all by themselves. Like the main CPU, the video card runs it's own, highly specialized operating system and communicates with the rest of the system via the AGP interface. The communication can be initiated both by the video chip and the main CPU, and the AGP interface in the main board's chipset controls this communication.

When all goes well, you will never notice anything of this. You only see the result, which is a great looking image in your game of choice. However, things can, unfortunately, go horribly wrong. When the video card is not used as a dumb frame buffer card (something that the standard PCI VGA driver does), the main CPU does not manipulate the contents of the frame buffer directly. Instead, it tells the video processor what to do. The video processor then executes those commands. For this to work, the CPU must be able to tell the video chip what to do, and the video chip must be able to accept those commands. The AGP interface is what connects these two subsystems. Now, in order to speed up processing on both sides of the AGP interface, the chipset maintains a command queue, which buffers the various instructions until such time as the video chip is ready to process them. The size of this buffer is actually determined by the chipset that is in use on your motherboard.

What causes intermittent freezes

So, what happens if the CPU is stuffing commands faster into the AGP pipeline than the video chip can execute them? Well, sooner or later that buffer fills up. When that happens, the CPU will be stalled by the chipset until such time as the video chip has executed it's current command, and retrieves the next pending one from the AGP pipeline. That will free up a slot at the other end. The CPU can now finish putting it's command into the AGP pipeline. The stall is lifted, and the CPU is released by the chipset and can finally continue executing program instructions. If the video chip is slow at processing commands for any reason, then this stalling of the CPU by the main board's chipset will be perceived by you, the user, as a temporary system freeze.

A full system freeze

Things can become even worse, if for some reason the video chip stops retrieving commands from the AGP pipeline. Then the temporary CPU stall becomes a permanent one. Since the CPU isn't allowed to execute new program instructions, it cannot respond to keystrokes, mouse clicks and what not. Even the sound card's interrupts won't be honored. That usually causes a sound card to repeat it's most recently loaded sound fragment over and over again.

What can cause such a condition to occur? Well, as said previously, a modern video chip is a highly sophisticated mini computer with it's own operating system. Like Windows, this OS can crash. When it crashes, it won't execute it's program until it gets rebooted. A video reset could do the trick, but it's not easy to let the main CPU issue a reset command if the CPU itself is stalled because the AGP pipeline is filled up, because of the video chip's crash. So a hard system reset or a power cycle is usually the only viable way out.

Crash caused by insufficient power

The most likely cause of a video card's crash is, believe it or not, insufficient power. Like it or not, but modern day PC's are extremely power hungry. What's more, the tolerances for voltage fluctuations are significantly less than a couple of years ago. True, the tolerances are still rated as plus or minus 5%, but on today's AGP x8 boards that is 5% of 0.8 volt, and not 5% of 5 volt which it was a mere 5 years ago. Which means that today's chips are far less forgiving if you have a power supply that is not completely up to the task. As a rule of thumb, a good power supply used in any Pentium 4 or AMD Athlon system which is paired with a modern AGP video board should be able to deliver at least 300 W. Be advised, this is not 300 W input, but 300 W output. Power supplies, when they operate, incur thermal loss. On a good power supply this is as little as 15%. On a bad one, this can be as high as 50%. As a second rule, the power supply must be able to deliver 21 Amps combined on the 3.3 and 5 volt power rails. This is not the same as simply adding up the separate Amps listings of the 3.3 and 5 volt rails. A good power supply will list the combined Amps as a separate rating.

Crash caused by overheating

A second cause of a video board crash is overheating. Modern video processors run hot, even hotter than your main processor. And while the main processor gets a big cooling solution, the video chip usually has nothing more than a large heat spreader and a small fan. What's worse, the mounting location of the AGP card itself in most computer casings is so bad, that the tiny little cooling fan cannot suck in (enough) cool air and get rid of the heated air. And this causes the temperature to rise, especially if the chip is working hard, like in a game. When it overheats, a few things can happen. If you're lucky, the card has thermal protection and the chip simply stalls until it's cooled off a bit. Like in the filled up AGP pipeline case, you will perceive this as a momentary systemwide freeze that lasts a couple of seconds. If you're unlucky, the chip starts behaving erraticly or stops alltogether. Again, this causes a permanent system freeze until a hard reset or power cycle.

Crash caused by buggy AGP driver

A third cause for a complete system freeze is the AGP driver software itself. Intel has written the specs for the AGP interface, and these specs allow for the main CPU and the video card to both access main RAM. So, it can happen that both want to access the same location at the same time. Normally this would not be a problem, as only one device can access main memory at any given time, and so either the CPU waits for the video processor or vice versa. However, this does interfere with another portion of the specs, specifically dealing with the CPU side of the communication. Intel has specified that all data transfers should happen in 64 bit chunks, or 8 bytes. Intel also specified that these chunks should always start at multiples of 8 bytes. However, there is a provision that allows access on the uneven 4 byte boundaries, in which case the actual data transfer is split into two separate ones. The first one deals with the lower 4 bytes (scaled up to 8 bytes), and the next one deals with the upper 4 bytes (also scaled to 8 bytes).

While a driver is allowed to do this, it is highly discouraged. The reason why is very simple. As stated before, AGP allows the video processor to initiate memory access. What happens if the video processor wants access to the same memory location as the main CPU is dealing with right now, and it does this precisely between the two split up partial data transfers? Well, the mainboard's chipset refuses the video processor access, because part of the memory transfer concerning precisely that location hasn't finished yet. Allowing the video processor to procede would alter the memory location, and this would corrupt the pending second half of the CPU's data transfer. By the same token, the second half of the CPU's data transfer will be rejected by the video processor. So both data transfers are essentially blocked, and both the video processor and the CPU are stalled, and cannot continue with their respective programs. Again, what we have here is a complete system freeze. This particular variant was the first confirmed case of a system freeze, and because the data transfer requests bounce back and forth between the video chip and the CPU indefinately, Microsoft called this type of problem 'infinite loop'. To date, mostly VIA is guilty of this type of failure in their AGP driver, which is part of the VIA 4in1 driver package, later dubbed Hyperion drivers. That's why owners of VIA chipsets (especially the aging KT133 and KT266 models) are hit with the freeze more often than owners of other types of chipsets.

Crash caused through a bug elsewhere

This is something that is extremely hard to trace, and it's probably never fully reproducable. As you know (if you don't, look [here] for an explanation), a modern PC has something that is called the AGP Aperture. The size is controlled through a BIOS setting, and it sets aside a chunk of main RAM for exclusive use by the video card.

The problem is, the Windows OS (especially Windows 9x platforms) does not really protect the RAM area that is set aside for the video card. In other words, it is possible for an application to access and (even worse) alter the contents of the memory in the Aperture area. When that happens, the video card's processor can start behaving erraticly, or even crash completely. Early releases of Windows 2000 even did that themselves through the swapfile logic, in combination with an AMD Athlon processor. It turned out that the Athlon's level 1 and 2 processor caches worked differently than Intel, and even while they worked as designed and fully reliable, the OS code relied too much on the Intel method of caching. That could result in the swap file code overwriting parts of the Aperture memory area. Later service packs have corrected this problem.

Note that it's probably not intentional if (and when) an application alters the contents of the Aperture. Most of the time, an application contains a bug, which in turn causes it to access memory outside of it's officially allocated memory. When such a location happens to be in the Aperture area, we can run into this kind of trouble. Mind you, a good OS (with proper processor support, naturally), would not let any application run amok like this. Unfortunately, Windows (even Windows XP) is only a mediocre OS in this respect. The Intel processors contain full protection logic all the way back to the 80386. Windows just doesn't make (full) use of the available protection mechanisms.

Possible solutions to AGP/video related crashes

If you get hit with any of these problems, there are a number of things one can do. Ultimately, all these measures seek to slow down the speed of the video card and/or the AGP interface. Less speed means less heat and less Amps drawn from the power supply.

Lower your AGP bus rating. Usually, the BIOS allows for manual selection between x1, x2, x4 or x8.

Disable sidebanding. Sidebanding is an AGP pipeline feature that implements a sort of passing lane for AGP commands, in which special AGP commands can bypass pending requests in the regular AGP pipeline buffer. Not all video chips implement this feature as robust as it should.

Disable fast writes. Again, this will make your AGP pipeline slower, thus slowing down the video processor with it.

If your system came with power and temperature monitoring software, start it before running the game. While the game is running, check the readouts of the monitoring software to see if the temperature rises to critical levels, or the power (especially the 3.3 and 5 volt rails) fluctuate either dangerously close to the 5% rule or even exceed it. If the temperature rises too high, you need better cooling. If the power fluctuates too much, you need a better power supply, or the power regulators on your mainboard cannot cope and run too hot. Power regulators are those medium sized vertically mounted chips with a large heat spreader mounted on the back. If those turn out to be the cause of the trouble, there is little else you can do besides buying a new mainboard.

If you are the unlucky owner of a VIA chipset (like me), do not install the VIA 4in1 drivers. Instead, rely on the AGP support that comes default in Windows itself, together with support from the video driver. I am not sure about ATI or other brands, but I know for a fact that NVidia drivers will correctly activate the AGP support for a VIA chipset if the VIA 4in1 drivers are not installed. Unfortunately, this leaves owners of Windows XP without a solution. Windows XP installs those pesky VIA 4in1 drivers right from the installation CD. In other words, you no longer have the choice not installing those drivers. Downgrade to Windows 2000 if all else fails.