Modern AI tools have taken the world by storm. There are countless models available to choose from to accomplish a cornucopia of tasks. From the lowly classification problem all the way up to prompt-based text, video, and audio generation - AI seems capable of just about anything.
The key phrase in the previous sentence is "seems capable."
Recently, researchers at MIT found that earlier reports of ChatGPT's ability to ace the Bar Exam were overstated. Originally OpenAI (the developers of ChatGPT) had claimed that ChatGPT scored in the 90th percentile on the examination. An analysis by an MIT team found that while ChatGPT scored in the 90th percentile compared to people who had to retake the exam because they failed, its performance fell to below the 50th percentile when compared to people that took the exam for the first time. The AI also got its lowest marks in the generation of legal essays - an essential task of effective law practice.
The finding by MIT researchers helps highlight some of the reasons that we at the ARRT report statistics of only first-time examinees. For one, people that are retaking the exam have previously failed it - adding a pool of lower scores that shifts the distribution of scores lower. Comparing someone's score to the distribution including repeat examinees has the potential to overestimate an individual's ability. Second, first-time examinees have not previously been exposed to the contents of a given examination. As a result, there is no risk of a testing effect, which causes scores to increase on repeated attempts just because someone took the exam a second time. Finally, there are complexities that come into data analysis whenever repeat test takers are being analyzed. Researchers must answer several research questions when analyzing repeat test taker scores. Just a few are:
· Is it better to look at each individual score for the same person, or take an aggregation statistic of the multiple scores?
· If we do choose to use an aggregation statistic, which one (e.g., mean, median, etc.)?
· If we are using individual scores, will all test administrations or only some be used in the analysis?
Each of the previously mentioned research questions needs its own justification, and different researchers will make different choices - leading to reduced reproducibility. By contrast, since every examinee has only one score when analyzing first-time examinees, the previously mentioned research questions do not have to be asked, leading to increased reproducibility of findings.
It is important to be skeptical inquirers about the science journalism we see. The first time you hear a finding from the science media it is wise to suspend judgement on the issue until more studies are conducted. Often nuance is missed in a first study that subsequent studies will provide. One of the best ways of learning "what the science says" is to wait for a meta-analysis to be reported - a collection of studies all synthesized together into a more generalizable set of statistics. Above all, be cautious when reading science literature; wait for the truth to put on its running shoes.