# Glossary and suggestions

This glossary provides definitions, ranges, and suggestions for the primary statistics used in EAC Visual Data. For a well-written, concise paper on point biserial correlations and p-values we encourage you to read Seema Varma's Preliminary Item Statistics Using Point-Biserial Correlations and P-Values.

##### Actual item scores

Actual item scores equals Possible item scores less the total number of unanswered (i.e., skipped) questions. In the example below, there are 139 actual item scores which equals 140 possible item scores less 1 skipped question.

##### Cronbach Alpha with Deletion

Cronbach Alpha with Deletion helps assess test question reliability.

How? It asks whether the exam as a whole is more reliable if you simply delete the question under review. The Cronbach Alpha with Deletion re-runs the exam's KR(20) without the question under review. If the exam as a whole is more reliable without it, there's probably something wrong with that question.

The Cronbach Alpha with Deletion generally ranges between 0.0 and +1.0, but it can fall below 0.0 with smaller sample sizes.

More important than its range is how the Cronbach Alpha with Deletion compares to the exam's KR(20). If a question's Cronbach Alpha with Deletion is greater than the exam's KR(20), it means the exam as a whole is more reliable without it. For example, take a look at Question No. 2 in the picture below. If the exam's KR(20) is 0.64, then we know Question No. 2 is "suspect" because its Cronbach Alpha with Deletion of 0.78 is greater than 0.64.

EAC suggestion: Look out for questions with a Cronbach Alpha with Deletion greater than the exam's KR(20). These questions decrease overall test reliability and should be considered suspect.

##### Distractor point biserial correlation

To have confidence in a test question, we assess its reliability using a point biserial correlation and a Cronbach Alpha with Deletion among other statistics. If it turns out to be an unreliable question, then it would help to have additional information that might tell us why. That's where the distractor point biserial correlation comes in.

The distractor point biserial correlation digs deeper than the item statistics and measures the reliability of each answer choice presented to students.

How? It correlates student scores on each answer choice with their scores on the test as a whole.

The driving assumption is simple: Students who score well on the test as a whole should on average select the correct answer choice for each question. Students who struggle on the test as a whole should on average select an incorrect answer choice for each question. If an answer choice deviates from this assumption, the distractor point biserial correlation lets us know.

The distractor point biserial correlation ranges from a low of -1.0 to a high of +1.0.

The closer a correct answer choice's distractor point biserial correlation is to +1.0 the more reliable the answer choice is considered because it discriminates well among students who mastered the test material and those who did not. This answer choice "works well."

By the same token, the closer an incorrect answer choice's distractor point biserial correlation is to -1.0 the more reliable the answer choice is considered because it discriminates well among students who did not master the test material and those who did.

EAC suggestion: Consider changing answer choices that aren't "working" as expected and also those that students don't select at all. If no student selected an answer choice, that answer choice isn't really a "distractor" after all.

##### Highest score

The highest number of correct answers on any one test submission. The highest score may not correspond to Grade Center "points" unless each question is equal to 1 point.

##### KR(20)

The KR(20), or Kuder-Richardson Formula, measures overall test reliability.

It lets you know whether the exam as a whole discriminated among students who mastered the subject matter and those who did not.

The KR(20) generally ranges between 0.0 and +1.0, but it can fall below 0.0 with smaller sample sizes. The closer the KR(20) is to +1.0 the more reliable an exam is considered because its questions do a good job consistently discriminating among higher and lower performing students. A KR(20) of 0.0 means the exam questions didn't discriminate at all. Imagine a test where all 20 students answered all 40 questions correctly. The test didn't discriminate among any of them, and its KR(20) of 0.0 makes perfect sense.

EAC suggestion: The interpretation of the KR(20) depends on the purpose of the test. Most high stakes exams are intended to distinguish those students who have mastered the material from those who have not. For these, shoot for a KR(20) of +0.50 or higher. A KR(20) of less than +0.30 is considered poor no matter the sample size. If the purpose of the exam is to ensure that ALL students have mastered essential skills or concepts or the test is a "confidence builder" with intentionally easy questions, look for a KR(20) close to 0.00.

##### Lowest score

The lowest number of correct answers on any one test submission. The lowest score may not correspond to Grade Center "points" unless each question is equal to 1 point.

##### Questions

The total number of questions on the exam.

EAC suggestion: Shoot for at least 40 questions to get "good" reliability statistics.

##### p-Value

In the branch of statistics dealing with test reliability, and unlike other branches of statistics, p-Value is a simple measure of question difficulty. The p-Value ranges from a low of 0.0 to a high of +1.0. The closer the p-Value is to 0.0 the more difficult the question. For example, a p-Value of 0.0 means that no student answered the question correctly and therefore it's a really hard question. If an item's p-Value is unexpectedly close to 0.0, be sure to check the answer key. The closer the p-Value is to +1.0 the easier the question. For example, a p-Value of +1.0 means that every student answered the question correctly and therefore it's a really easy question.

EAC suggestion: On high stakes exams, shoot for p-Values between +0.50 and +0.85 for most test questions. A p-Value less than +0.50 means the question may be too difficult or you should double-check the answer key. A p-Value greater than +0.85 means the question may be too easy or most students have mastered that concept.

##### Point biserial correlation

The point biserial correlation measures item reliability.

How? It correlates student scores on one particular question with their scores on the test as a whole.

The driving assumption is simple: Students who score well on the test as a whole should on average score well on the question under review. Students who struggle on the test as a whole should on average struggle on the question under review. If a question deviates from this assumption (aka, a "suspect" question), the point biserial correlation lets us know.

The point biserial correlation ranges from a low of -1.0 to a high of +1.0. The closer the point biserial correlation is to +1.0 the more reliable the question is considered because it discriminates well among students who mastered the test material and those who did not.

A point biserial correlation of 0.0 means the question didn't discriminate at all. Imagine a test where all 20 students answered Question 1 correctly. Since Question 1 doesn't discriminate among any of the students relative to how they performed on the rest of the test, its point biserial correlation of 0.0 makes perfect sense.

A negative point biserial correlation means that students who performed well on the test as a whole tended to miss the question under review and students who didn't perform as well on the test as a whole got it right. It's a red flag, and there are a number of possible things to check. Is the answer key correct? Is the question clearly worded? If it's multiple choice, are the choices too similar?

EAC suggestion: For high stakes exams intended to distinguish among students who mastered the material from those who did not, shoot for questions with point biserial correlations greater than +0.30. They're considered very good items. Questions with point biserial correlations less than +0.09 are considered poor. Questions with point biserial correlations between +0.09 and +0.30 are considered acceptable to reasonably good.

##### Possible item scores

The product of Scored responses x Questions. In the example below, there are 140 possible item scores which equals 14 Scored Responses x 10 Questions.