Mark D. Shermis, Ben Hamner

This study compared the results from nine automated essay scoring (AES) engines on eight essay scoring prompts drawn from six states that annually administer high-stakes writing assessments.

H1: Human Rater 1
H2: Human Rater 2
AIR: American Institutes for Research
CMU: Carnegie Mellon University
CTB: CTB McGraw-Hill
ETS: Educational Testing Service
MI: Measurement, Inc.
MM: MetaMetrics
PKT: Pearson Knowledge Technologies
PM: Pacific Metrics
VL: Vantage Learning

The results demonstrated that overall, automated essay scoring was capable of producing scores similar to human scores for extended-response writing items.

2 thoughts on “Mark D. Shermis, Ben Hamner

  1. shinichi Post author

    In the first national study comparing vendors… PEG Software Wins Automated Essay Scoring Competition

    http://www.urbanplanetmobile.com/news/in-the-first-national-study-comparing-vendors-peg-software-wins-automated-essay-scoring-competition

    In a study released today at the annual meeting of the National Council on Measurement in Education (NCME), MI’s Project Essay Grade software delivered the most accurate scores in eight separate evaluations of student writing. PEG’s ability to predict scores outpaced eight other vendors of automated scoring software and proved even more reliable than two professional human judges. ”We are extremely gratified that PEG software performed so well,” said Michael Bunch, Ph.D.,

    Senior Vice President of Research and Development at Measurement Incorporated. “Although PEG research dates back more than 40 years to the original work of Ellis Batten Page, this is the first study that actually compared the relative performance of commercial scoring engines.”

    About the Study

    In January, the Hewlett Foundation invited Measurement Incorporated (MI) and eight other vendors of artificial intelligence (AI) scoring of student essays to participate in the Automated Scoring Assessment Prize (ASAP) competition. The competition included essays written to eight different prompts by students in various grade levels.

    Each essay had been scored by two professionally trained readers. Their scores were used to ”train” the AI scoring programs. All such programs “learn” by reading digitally entered essays and their associated scores and drawing conclusions about how essay content leads to a given score.

    After providing sample essays for vendors to train their programs, the Hewlett Foundation sent 4,000 more essays for the vendors to score within three days. Dr. Mark Shermis, a recognized authority on automated essay scoring, then checked the work of each vendor, comparing their results to scores previously given to the same essays by trained (human) readers. He also computed agreement indices for pairs of human readers.

    The agreement index Dr. Shermis used was more stringent than simple agreement in that it did not give credit for chance agreement. For example, on a four-point scale, it would be possible to agree one-fourth of the time by chance. The index has been used for the past 50 years and is well documented. After computing an agreement index for each of the eight essay prompts for each vendor, it is also possible to average over all eight prompts to produce a single score for each vendor. The results are shown in the figure below.

    Measurement Incorporated’s PEGScores achieved the highest agreement index, as the figure shows. In addition, five vendors (MI and vendors A-D) had higher agreement indices than human readers had with each other (as shown by the dashed line at .75). The dashed line near the bottom of the graph shows the agreement that could be achieved simply by estimating each essay’s score based on the number of words it contained (word count).

    The ASAP competition represents the first time these widely used programs have been compared to one another under controlled, objectively refereed conditions. The results demonstrate conclusively the viability of AI scoring in general and MI’s leadership in particular. For a complete copy of the study, please click here.

    About PEG Software

    Project Essay Grade (PEG) software is an automated essay scoring solution based on more than 40 years of research by Ellis Batten Page, Ed. D., whose pioneering work in the fields of artificial intelligence (AI), computational linguistics and natural language processing has distinguished him as the father of computer-based essay scoring. In 2003, MI acquired the PEG technology and continues to develop and extend Page’s pioneering research. Coupled with MI’s 25+ years of experience in performance assessment (human scoring of essays), today’s PEG software delivers valid and reliable scoring that is unrivaled in the industry.

    About Measurement Incorporated

    MI is an employee-owned company headquartered in Durham, North Carolina, and has been a leader in educational assessment since 1980. The company provides a wide range of educational and testing services and specializes in the development, scoring and reporting of large-scale educational tests.

    Reply

Leave a Reply

Your email address will not be published.