The Precision of Exams

Back when I was at IIT Bombay, I often heard the refrain that the JEE was much tougher than the SAT, and this fact was offered as evidence of the high standard of the IITs and their students.

I do not want to discuss the standards of IITs here, but I do want to argue against “toughness” of exams as a measure of how good they are. An exam is a measuring instrument. A measuring instrument should be graded against how well it measures what it is expected to measure, and whether it delivers the precision that is required of it. It is obviously a bad idea to use a weighing pan to measure the width of a road, but it is also a bad idea to use vernier callipers for the purpose, or expect the width of a road to be accurate to the tenth of a millimetre.

Exams too should be evaluated against similar criteria. I can think of three types of exams, each type with a distinct purpose for conducting them, and I am sure there are more that I have not thought of. I’ll name them types 1, 2 and 3.

Type 1, exemplified by the JEE or CAT, are exams whose purpose is to select the very best of those who apply, and figure out fine gradations among them. The JEE assigns ranks to around 2% of those who take it. It has to reliably distinguish between the person who obtained the 10th rank and the 100th rank. It achieves this purpose by being an extremely tough exam. But the price it pays for being a difficult exam is that almost everyone below the top 2% scores close to zero, and the exam therefore fails to distinguish someone who performed at the 95th percentile from one who performed at the 85th.

That is where Type 2 exams come in. The SAT and GRE exams are of this type. Our board exams should belong here too, but I’ll have more to say about them later. The objective of these exams is to grade along a curve the entire population that takes them. Type 2 exams cannot be as tough as Type 1 exams are. They must have a mixture of questions of varying levels of difficulty. They can reliably distinguish between the 90th and the 60th percentile, but shouldn’t be used to decipher fine distinctions between the test-takers at percentile 99.2 and 99.5. The margin of error is too large to make such distinctions.

When elite American colleges use the SAT or the GRE in their admission decisions, they do not use the exam score as the only criterion. They use the score as a cut-off and then make the final decision based on other parameters, or use it as one criterion among many. That is the correct thing to do. The scores simply do not provide enough precision to be used on their own.
I often hear about proposals to expand the scope of exams like the JEE to turn them into exams for entrance into more engineering colleges than just the IITs. When you do that, you are imbuing more Type 2-ness to what was formerly a Type 1 exam. That is not a problem if you do it consciously and carefully. You also need to be aware that when you do that, the JEE will be less able to satisfy Type 1 requirements. It will not be possible to make fine-grained distinctions between top people using JEE results. I am concerned that our policymakers are too casual about these things and do not consider these things when making decisions.

Type 3 exams are the easiest of the three. They are like the assessments I have to pass to prove that I have paid attention to the mandatory web based trainings I have to take every year at my workplace. I work for an American bank. As part of a compliance requirement, all employees undergo trainings related to Know-Your-Customer (KYC) and Anti-money Laundering (AML). The tests I have to take at the end are easy. The passing score is 80%. The objective of the tests is to validate that everyone at the bank has the baseline knowledge of the contents of the courses. It is not to select the best among them. Getting 100% of the answers right won’t get you an automatic job offer from the compliance department. The objective isn’t even to grade everyone on a curve. Everyone eventually scores between 80 and 100 percent in the test, and there’s really no material difference between the capabilities of someone who scored 80 and somebody who scored 100.

I hope I’ve made the case for why the toughness of an exam isn’t a good measure of its quality and why we should take into consideration what the exam measures to determine how well it does it. How do our board exams measure up?

Board exams should be Type 2 exams. Their purpose is to grade everyone who takes them on a curve. A student who has achieved what we consider the lowest acceptable level of competence in the subject must get passing marks. The best students should get the highest marks. Between the highest and the passing scores, there should be a wide spread of scores that enable us to reliably distinguish test takers’ performances from one another.

Unfortunately, that is not how things are. Our board exams are too easy. I would estimate that the minimally competent student should easily score 75% in them. The boards set the bar too low because most schools have abysmal standards of education. If our board exams were calibrated to a reasonable level of difficulty, failure rates would be too high, triggering a political backlash. Once in a while, we see reports in newspapers that a particular CBSE exam had too many questions that were “out of syllabus”. I suspect that most of the time, “out of syllabus” just means that it didn’t conform to the pattern that students expected it to follow.

Why do these exams follow “patterns”? Apart from the political reason already referred to, there’s the expense of calibration. There aren’t enough qualified teachers to correct these papers. Preparing a key that will enforce consistency in marking is tougher for tough papers. It is far easier to have a limited set of questions with model answers provided, so that examiners can correct via pattern recognition.

The upshot of all this is that these exams, that should be type 2, are closer to type 3. Anyone with a baseline level of knowledge should score very high, and there is really not much difference between someone who scored 85% and someone who scored 95%.
In the hype that is created around these exam results and the impact that they have on the students’ prospects, they are type 1. Ranks are assigned over tenths of percentage points, and admissions to courses in premier educational institutions are won or lost over them.

So how do you “crack” an exam with predictable questions, and one that is corrected using pattern recognition? You mug up the model answers. You go to coaching classes whose goal is not to supplement the knowledge you’ve gained at school, but to replace what schools are supposed to teach you with a database of questions and answers in your mind.
I’ve seen people defending rote learning as one of the valid forms of gaining knowledge. The point though is that while there may very well be a case for learning some facts or multiplication tables or formulas by rote, learning model questions and answers by rote is not something anyone can reasonably defend.

These exams aren’t the primary cause for what’s wrong with our educational system, but fixing them is the point at which you should start, because if you fix the exams, the incentive to mug up things will reduce, and things will start changing down the line.

The Examined Life

Where I torture reality till it confesses the truth

The Precision of Exams