The GMAN wrote:
The test changed sometime in the late-90s to where the math/quant part was adaptive in nature. So the test isn't as standardized like it had been before that, basically where everyone got a test booklet and more or less took the same/similar test. The test difficulty now depends on how well each test taker is doing on the test. So someone who scored a 760 took a much harder test than someone who scored a 560 because each correct answer yields a more difficult following question.
Reminds me of a bunch of test and measurement work I did in grad school. Early/mid-90s was about the time computerized adaptive testing (CAT) was coming into play. An adaptive test is a standardized test. Although there is a difference in questions that candidates see, the eventual score (e.g. a 760 or 560) is still the scores they would have gotten if they had seen the same set of questions.
Adaptive testing is designed to be more efficient in that instead of a test with 100 (or some number) questions, you can
reliably and validly assess a candidate after 30 questions (or some other number). In test and measurement terms, reliability means measures are consistent. Validity means it accurately measures what is purported to be measured (e.g. smarts). Need both reliability and validity in the test design process.
In tests like GMAT, MCAT, GRE, LSAT, and the rest of the alphabet soup of tests, these calibrations are built in and usually on the mark. In a typical survey, often none of this is designed in at all -- which can wildly skew results.