The Programming Contest Problem
Throughout my studies at Rutgers, I participated in the ACM International Collegiate Programming Contest (ICPC from here on). As with many other programming contests, an ICPC event is structured as a set of problem statements, which the teams are required to write programs to solve. The problems range from subtle or not-so-subtle cues to use standard algorithms, to puzzles that would be hard to solve in a week, much less the 5 hours allotted for the contest. Choosing which problems to tackle, and parceling them out to the appropriate team members, is itself a strategic challenge.
Someone unfamiliar with software development might wonder how a handful of judges can examine the programs submitted by dozens of teams. This is of course impossible; it is hard to check code by hand at all, and these hastily-written programs with their obtuse and error-prone algorithms are particularly inscrutable. So rather than doing anything directly with the source code, the program is automatically run by the judging system, which checks the behavior of the program in a number of test cases.
Automated testing has its own pile of problems. Programming errors can cause the program to run forever, or consume all the memory on the system. Malicious programs may attempt to damage the judge's system, or interfere with the running of other programs. But these are all technical issues, and there are standard ways of dealing with them: programs that run too long are terminated and treated as incorrect, using excessive memory or accessing external resources are restricted by the environment. The interesting problem, which comes up in real world software testing as well, is how to design the test cases.
When the test suite is assembled, programs can be automatically run with each test case in turn. A tricky point is that some problems may have many valid outputs for a given test case input. If this is true, the judge may need to provide a program to check the answers. Even in non-contest scenarios this can be helpful; it is often the case that a program to check an answer is much simpler than one that can generate the answer, although this is far from universally true. Having a program to check the solution can also allow you to automatically generate test cases, which can allow for much more thorough testing than a hand-generated set of tests. In my experience with the ICPC this was generally not the practice; test cases were usually of the form "input" and "expected output", and meaningful deviation from the expected output is a failure.
Now to the problem, at least as this relates to learning games. In the ICPC, the judging system does not give you any information about which test cases failed, and indeed most of the test cases are kept secret from the contestants. Partly this is because the contest is meant as an assessment, not a learning exercise, so giving useful feedback is not an intended part of the process (in fact incorrect submissions are explicitly penalized). But the more fundamental limitation is that it is possible to create a program that passes all the test cases without being a general solution to the problem. With a small input space all possible inputs can be tested, but it is more common that the set of test cases simply does not completely specify the problem.
It is hard to design test cases that test adequately when you are trying to spot bugs, but a good semi-random spread of test cases and well-chosen special cases has a good chance of catching semi-random errors and common mistakes. People, on the other hand, are very good at picking out patterns in your semi-random data. If the test cases were available, and admit a simpler solution, then corners may be cut to only satisfy the tests. With computer programs this is trivially possible by simply hard coding special cases for each input! Thus the judges choose obscurity: if contestants know not where the tests will fall, they must cover all possibilities suggested by the problem statement. (This is not to say that the problem statement always completely defines the problem! Judges are only human, clarifications are often necessary, and unexpected solutions are occasionally accepted manually.)
It sounds like this is solved, then: simply give the player a clear problem statement and hide the test cases. So why do I call it a problem? If we want to use a "programming contest" arrangement for learning, we absolutely must provide useful feedback. Showing how the test cases fail is essential to this. However, if we show the player the case that doesn't work, we run the risk of him trying to treat each case specially, ignoring the problem statement (for which I don't blame him; English problem statements seem completely out of place in games). One solution is to limit the code size, in an attempt to restrict the player to the least code necessary to handle the problem. Limited code space, if it is sufficiently small, also allows us the benefit of exhaustively testing every program possible in that space, to ensure that there is no unexpected solution that misses the point of the problem.
The excessive restrictions needed to make this work suggests a more radical approach, which is nevertheless completely in line with gaming. What if we do away with the tyranny of the problem statement? The player's pattern-finding ability deserves exercise; if he can find a pattern in the data that allows him to take a shortcut, then bravo, he is playing a more interesting game! A scoring system based on code size could discourage degenerate case-by-case programs without necessarily imposing a hard limit. The problem statement could be treated as a helpful suggestion, possibly not even provided initially if the test cases can be coherently presented. It is interesting to consider the connection between impasse-driven learning and test-driven development; if building programs is like building knowledge then perhaps we can help our players to do both at once.
[update 12/16] I am only just becoming familiar with Kurt VanLehn's work through his 1988 paper Toward a Theory of Impasse-Driven Learning. I linked to an abstract of that paper above, but it would have been better to to refer directly to his repair theory, which is the big idea here.