Help! My Practice Test Score Seems Wrong!

MBA Interview QuestionsSo you’ve taken your GMAT practice test, looked at your score, and investigated a little further. If you’re like many GMAT candidates, you’ve tried to determine how your score was calculated by:

  • Looking at the number you answered correctly vs. the number you answered incorrectly, and comparing that to other tests you’ve taken.
  • Analyzing your “response pattern” – how many correct answers did you have in a row? Did you have any strings of consecutive wrong answers?

And if you’ve taken at least a few practice tests, you’ve probably encountered at least one exam for which you looked at your score, looked at those dimensions above, and thought “I think my score is flawed” or “I think the test is broken.” If you’re taking a computer-adaptive exam powered by Item Response Theory (such as the official GMAT Prep tests or the Veritas Prep Practice Tests), here’s why your perception of your score may not match up with your actual, valid score:

The number of right/wrong answers is much less predictive than you think.
Your GMAT score is not a function of the number you answered correctly divided by the number you answered overall. Its adaptive nature is more sophisticated than that – essentially, its job is to serve you questions that help it narrow in on your true score. And to do so, it has to test your upper threshold by serving you questions that you’ll probably get wrong. For example, say your true score is an incredibly-high 790. Your test might look something like:

Are you better than average?  (You answer a 550-level question correctly.)

Ok, are you better than a standard deviation above average? (You answer a 650-level question correctly.)

Ok, you’re pretty good. But are you better than 700 good?  (you answer a 700-level question correctly)

Wow you’re really good.  But are you 760+ good? (You answer a 760 level question correctly.)

If you’re 760+ level are you better or worse than 780? (You answer a 780-level question correctly.)

Well, here goes…are you perfect? (You answer an 800-level question incorrectly.)

Ok, so maybe one or more of those earlier questions was a fluke. Are you better than 760? (You answer a 760 question correctly.)

Are you sure you’re not an 800-level student? (You answer 800 incorrectly.)

Ok, but you’re definitely better than 780, right? (You answer a 780 correctly.)

Are you sure you’re not 800-level? (You answer an 800-level question incorrectly.)

And this goes on, because it has to ask you 37 Quant and 41 Verbal questions, so as the test goes on and you answer you own ability level correctly, it then has to ask the next level up to see if it should increase its estimate of your ability.

The point being: because the system is designed to hone in on your ability level, just about everyone misses several questions along the way. The percentage of questions you answer correctly is not a good predictor of your score, because aspects like the difficulty level of each question carry substantial weight. So don’t simply count rights/wrongs on the test, because that practice omits the crucial IRT factor of difficulty level.

Now, savvier test-takers will then often take this next logical step: “I looked at my response pattern of rights/wrongs and based on that it looks like the system should give me a higher score than it did.” Here’s the problem with that:

Of the “ABCs” of Item Response Theory, Difficulty Level is Only One Element (B)…
…and even at that, it’s not exactly “difficulty level” that matters, per se. Each question in an Item Response Theory exam carries three metrics along with it, the A-parameter, B-parameter, and C-parameter. Essentially, those three parameters measure:

A-parameter: How heavily should the system value your performance on this one question?

Like most things with “big data,” computer adaptive testing deals in probabilities. Each question you answer gives the system a better sense of your ability, but each comes with a different degree of certainty.  Answering one item correctly might tell the system that there’s a 70% likelihood that you’re a 700+ scorer while answering another might only tell it that there’s a 55% likelihood. Over the course of the test, the system incorporates those A-parameters to help it properly weight each question.

For example, consider that you were able to ask three people for investment advice: “Should I buy this stock at $20/share?” Your friend who works at Morgan Stanley is probably a bit more trustworthy than your brother who occasionally watches CNBC, but you don’t want to totally throw away his opinion either. Then, if the third person is Warren Buffet, you probably don’t care at all what the other two had to say; if it’s your broke uncle, though, you’ll weight him at zero and rely more on the opinions of the other two. The A-parameter acts as a statistical filter on “which questions should the test listen to most closely?”

B-parameter: This is essentially the “difficulty” metric but technically what it measures is more “at which ability level is this problem most predictive?”

Again, Item Response Theory deals in probabilities, so the B-parameter is essentially measuring the range of ability levels at which the probability of a correct answer jumps most dramatically. So, for example, on a given question, 25% of all examinees at the 500-550 level get it right; 35% of all those at the 550-600 level get it right; but then 85% of users between 600 and 650 get it right. The B-parameter would tell the system to serve that to examinees that it thinks are around 600 but wants to know whether they’re more of a 580 or a 620, because there’s great predictive power right around that 600 line.

Note that you absolutely cannot predict the B-parameter of a question simply by looking at the percentage of people who got it right or wrong! What really matters is who got it right and who got it wrong, which you can’t tell by looking at a single number. If you could go under the hood of our testing system or another CAT, you could pretty easily find a question that has a “percent correct” statistic that doesn’t seem to intuitively match up with that item’s B-parameter. So, save yourself the heartache of trying to guess the B-parameter, and trust that the system knows!

C-parameter: How likely is it that a user will guess the correct answer? Naturally, with 5 choices this metric is generally close to 20%, but since people often don’t guess quite “randomly” this is a metric that varies slightly and helps the system, again, determine how to weight the results.

With that mini-lesson accomplished, what does that mean for you? Essentially, you can’t simply look at the progression of right/wrong answers on your test and predict how that would turn into a score. You simply don’t know the A value and can only start to predict the “difficulty levels” of each problem, so any qualitative prediction of “this list of answers should yield this type of score” doesn’t have a high probability of being accurate.  Furthermore, there’s:

Question delivery values “content balance” more than you think.
If you followed along with the A/B/C parameters, you may be taking the next logical step which is, “But then wouldn’t the system serve the high A-value (high predictive power) problems first?” which would then still allow you to play with the response patterns for at least a reasonable estimate. But that comes with a bit more error than you might think, largely because the test values a fair/even mix of content areas a bit more than people realize.

Suppose, for example, that you’re not really all that bright, but you had the world’s greatest geometry teacher in high school and have enough of a gambling addiction that you’re oddly good with probability. If your first several – high A-value – problems are Geometry, Probability, Geometry, Geometry, Geometry, Probability… you might get all three right and have the test considering you a genius with such predictive power that it never actually figures out that you’re a fraud.

To make sure that all subject areas are covered and that you’re evaluated fairly, the test is programmed to put a lot of emphasis on content balancing, even though it means you’re not always presented with the single question that would give the system the most information about you.

If you have already seem a lot of Geometry questions and no Probability questions, and the best (i.e., highest A-value) question at the moment is another Geometry question, then the system may very well choose a Probability question. The people who program the test don’t give the system a lot of leeway in this regard—all topics need to be covered at about the same rate from one test taker to the next.

So simply put: Some questions count more than others, and they may come later in the test as opposed to earlier, so you can’t quite predict which problems carry the most value.

Compounding that is:

Some questions don’t count at all.
On the official GMAT and on the Veritas Prep Practice Tests, some questions are delivered randomly for the express purpose of gathering information to determine the A, B, and C parameters for use in future tests. These problems don’t count at all toward your score, so your run of “5 straight right answers” may only be a run of 3 or 4 straight.

And then of course there is the fact that:

Every test has a margin of error.
The official GMAT suggests that your score is valid with a margin of error of +/- 30 points, meaning that if you score a 710 the test is extremely confident that your true ability is between 680 and 740, but also that it wouldn’t be surprised if tomorrow you scored 690 or 720. That 710 represents the best estimate of your ability level for that single performance, but not an absolutely precise value.

Similarly, any practice test you take will give you a good prediction of your ability level but could vary by even 30-40 points on either side and still be considered an exceptionally good practice test.

So for the above reasons, a test administered using Item Response Theory is difficult to try to score qualitatively: IRT involves several metrics and nuances that you just can’t see. And, yes, some outlier exams will not seem to pass the “sniff test” – the curriculum & instruction team here at Veritas Prep headquarters has seen its fair share of those, to be sure.

But time and time again the data demonstrates that Item Response Theory tests provide very reliable estimates of scores; a student whose “response pattern” and score seem incompatible typically follows up that performance with a very similar score amidst a more “believable” response pattern a week later.

What does that mean for you?

  • As hard as it is to resist, don’t spend your energy and study time trying to disprove Item Response Theory. The only score that really matters is the score on your MBA application, so use your time/energy to diagnose how you can improve in preparation for that test.
  • Look at your practice tests holistically. If one test doesn’t seem to give you a lot to go on in terms of areas for improvement, hold it up against the other tests you’ve taken and see what patterns stand out across your aggregate performance.
  • View each of your practice test scores more as a range than as an exact number. If you score a 670, that’s a good indication that your ability is in the 650-690 range, but it doesn’t mean that somehow you’ve “gotten worse” than last week when you scored a 680.

A personal note from the Veritas Prep Academics team:
Having worked with Item Response Theory for a few years now, I’ve seen my fair share of tests that don’t look like they should have received the score that they did. And, believe me, the first dozen or more times I saw that my inclination was, “Oh no, the system must be flawed!” But time and time again, when we look under the hood with the psychometricians and programmers who consulted on and built the system, Item Response Theory wins.

If you’ve read this far and are still angry/frustrated that your score doesn’t seem to match what your intuition tells you, I completely understand and have been there, too. But that’s why we love Item Response Theory and our relationship with the psychometric community: we’re not using our own intuition and insight to try to predict your score, but rather using the scoring system that powers the actual GMAT itself and letting that system assess your performance.

With Item Response Theory, there are certainly cases where the score doesn’t seem to precisely match the test, but after dozens of my own frustrated/concerned deep dives into the system I’ve learned to trust the system.  Don’t try to know more than IRT; just try to know more than most of the other examinees and let IRT properly assign you the score you’ve earned.

Getting ready to take the GMAT? We have free online GMAT seminars running all the time. And as always, be sure to follow us on Facebook, YouTubeGoogle+ and Twitter!

By Brian Galvin and Scott Shrum.