Educational assessment
Educational evaluation method
From Wikipedia, the free encyclopedia
Educational assessment or educational evaluation[1] is the systematic process of documenting and using empirical data on the knowledge, skill, attitudes, aptitude and beliefs to refine programs and improve student learning.[2] Assessment data can be obtained by examining student work directly to assess the achievement of learning outcomes or it is based on data from which one can make inferences about learning.[3] Assessment is often used interchangeably with test but is not limited to tests.[4] Assessment can focus on the individual learner, the learning community (class, workshop, or other organized group of learners), a course, an academic program, the institution, or the educational system as a whole (also known as granularity). The word "assessment" came into use in an educational context after the Second World War.[5]
As a continuous process, assessment establishes measurable student learning outcomes, provides a sufficient amount of learning opportunities to achieve these outcomes, implements a systematic way of gathering, analyzing, and interpreting evidence to determine how well student learning matches expectations, and uses the collected information to give feedback on the improvement of students' learning.[6] Assessment is an important aspect of the educational process, which determines the level of accomplishments of students.[7]
The final purpose of assessment practices in education depends on the theoretical framework of practitioners and researchers, their assumptions and beliefs about the nature of the human mind, the origin of knowledge, and the process of learning.
Types
The term assessment generally refers to all activities teachers use to help students learn and gauge student progress.[8][9] Assessment can be divided for the sake of convenience using the following categorizations:
- Placement, formative, summative and diagnostic assessment
- Objective and subjective
- Referencing (criterion-referenced, norm-referenced, and ipsative (forced-choice))
- Informal and formal
- Internal and external
Placement, formative, summative, and diagnostic
Assessment is often divided into initial, formative, and summative categories to address different objectives of assessment practices.
(1) Placement assessment – Placement evaluation may be used to place students according to prior achievement or level of knowledge, or personal characteristics, at the most appropriate point in an instructional sequence, in a unique instructional strategy, or with a suitable teacher[10] conducted through placement testing, i.e. the tests that colleges and universities use to assess college readiness and place students into their initial classes. Placement evaluation, also referred to as pre-assessment, initial assessment, or threshold knowledge test (TKT), is conducted before instruction or intervention to establish a baseline from which individual student growth can be measured. This type of assessment is used to determine the student's skill level in the subject and can also help the teacher explain the material more effectively. These assessments are generally not graded.[11]
(2) Formative assessment – This is generally carried out throughout a course or project. It is also referred to as "educative assessment," which is used to help learning. In an educational setting, a formative assessment might be conducted by a teacher (or peer) or by the learner (e.g., through a self-assessment[12][13]), providing feedback on a student's work and would not necessarily be used for grading purposes. Formative assessments can take the form of diagnostic, standardized tests, quizzes, oral questions, or draft work. Formative assessments are conducted concurrently with instruction, and the results may count. The formative assessment aims to determine whether students understand the instruction before conducting a summative assessment.[11]
(3) Summative assessment – This is generally carried out at the end of a course or project. In an educational setting, summative assessments are typically used to assign students a course grade and are evaluative in nature. Summative assessments are used to summarize what students have learned and determine whether they understand the subject matter well. This type of assessment is typically graded (e.g., pass/fail, 0–100) and can take the form of tests, exams, or projects. Summative assessments are basically used to determine whether a student has passed or failed a class. A criticism of summative assessments is that they are reductive, and learners discover how well they have acquired knowledge too late for it to be of use.[11]
(4) Diagnostic assessment – At the end, the diagnostic assessment focuses on the whole difficulties that occurred during the learning process.
Jay McTighe and Ken O'Connor proposed seven practices for effective learning.[11] One of them is about showing the criteria of the evaluation before the test, and another is the importance of pre-assessment to know what the skill levels of a student are before giving instructions. Giving a lot of feedback and encouragement is another practice.
Educational researcher Robert Stake[14] explains the difference between formative and summative assessment with the following analogy:
When the cook tastes the soup, that's formative. When the guests taste the soup, that's summative.[15]
Summative and formative assessment are often referred to in a learning context as assessment of learning and assessment for learning, respectively. Assessment of learning is generally summative, intended to measure learning outcomes and report them to students, parents, and administrators. Assessment of learning mostly occurs at the conclusion of a class, course, semester, or academic year, while assessment for learning is generally formative and used by teachers to consider teaching approaches and next steps for individual learners and the class.[16]
A common form of formative assessment is diagnostic assessment. Diagnostic assessment measures a student's current knowledge and skills to identify a suitable program of learning. Self-assessment is a form of diagnostic assessment in which students assess themselves.
Forward-looking assessment asks those being assessed to consider themselves in hypothetical future situations.[17]
Performance-based assessment is similar to summative assessment in that it focuses on achievement. It is often aligned with the standards-based education reform and outcomes-based education movement. Though ideally, they are significantly different from a traditional multiple choice test, they are most commonly associated with standards-based assessment, which uses free-form responses to standard questions scored by human scorers on a standards-based scale, meeting, falling below, or exceeding a performance standard rather than being ranked on a curve. A well-defined task is identified, and students are asked to create, produce, or do something often in settings that involve real-world application of knowledge and skills. Proficiency is demonstrated by providing an extended response. Performance formats are further classified into products and performances. The performance may result in a product, such as a painting, portfolio, paper, or exhibition, or it may consist of a performance, such as a speech, athletic skill, musical recital, or reading.
Objective and subjective
Assessment (either summative or formative) is often categorized as either objective or subjective. Objective assessment is a form of questioning that has a single correct answer. Subjective assessment is a form of questioning that may have more than one correct answer (or more than one way of expressing the correct answer). There are various types of objective and subjective questions. Objective question types include true/false, multiple-choice, multiple-response, and matching questions, while Subjective questions include extended-response and essay questions. Objective assessment is well-suited to the increasingly popular computerized or online assessment format.
Some have argued that the distinction between objective and subjective assessments is neither useful nor accurate because, in reality, there is no such thing as "objective" assessment. In fact, all assessments are created with inherent biases built into decisions about relevant subject matter and content, as well as cultural (class, ethnic, and gender) biases.[18]
Basis of comparison
Test results can be compared against an established criterion, or against the performance of other students, or against previous performance:
(5)Criterion-referenced assessment, typically using a criterion-referenced test, as the name implies, occurs when candidates are measured against defined (and objective) criteria. Criterion-referenced assessment is often, though not always, used to establish a person's competence (i.e., whether they can do something). The best-known example of criterion-referenced assessment is the driving test, when learner drivers are measured against a range of explicit criteria (such as "Not endangering other road users").
(6)Norm-referenced assessment (colloquially known as "grading on the curve"), typically using a norm-referenced test, is not measured against defined criteria. This type of assessment is relative to the student body taking it. It is effectively a way of comparing students. The IQ test is the best-known example of norm-referenced assessment. Many entrance tests (to prestigious schools or universities) are norm-referenced, permitting a fixed proportion of students to pass ("passing" in this context means being accepted into the school or university rather than an explicit level of ability). This means that standards may vary from year to year depending on the quality of the cohort, whereas criterion-referenced assessment does not (unless the criteria change).[19]
(7)Ipsative assessment is self-comparison either in the same domain over time, or comparative to other domains within the same student.
Informal and formal
Assessment can be either formal or informal. Formal assessment usually implies a written document, such as a test, quiz, or paper. A formal assessment is given a numerical score or grade based on student performance, whereas an informal assessment does not contribute to a student's final grade. An informal assessment usually occurs more casually and may include observation, inventories, checklists, rating scales, rubrics, performance and portfolio assessments, participation, peer and self-evaluation, and discussion.[20]
Internal and external
Internal assessment is set and marked by the school (i.e., teachers), and students get the mark and feedback regarding the assessment. External assessment is set by the governing body and marked by unbiased personnel; some external assessments provide much more limited feedback in their marking. However, in tests such as Australia's NAPLAN, the criterion addressed by students is given detailed feedback so that their teachers can assess and compare students' learning achievements and plan for the future.
Standards of quality
In general, high-quality assessments are those with high reliability and validity. Other general principles are practicality, authenticity, and washback.[21][22]
Reliability
Reliability relates to the consistency of an assessment. A reliable assessment is one that consistently achieves the same results with the same (or similar) cohort of students. Various factors affect reliability—including ambiguous questions, too many options within a question paper, vague marking instructions, and poorly trained markers. Traditionally, the reliability of an assessment is based on the following:
- Temporal stability: Performance on a test is comparable on two or more separate occasions.
- Form equivalence: Performance among examinees is equivalent on different forms of a test based on the same content.
- Internal consistency: Responses on a test are consistent across questions. For example: In a survey that asks respondents to rate attitudes toward technology, consistency would be expected in responses to the following questions:
- "I feel very negative about computers in general."
- "I enjoy using computers."[23]
The reliability of a measurement x can also be defined quantitatively as: where is the reliability in the observed (test) score, x; and are the variability in 'true' (i.e., candidate's innate performance) and measured test scores, respectively. can range from 0 (completely unreliable) to 1 (completely reliable).
There are four types of reliability: student-related, which can include personal problems, sickness, or fatigue; rater-related, which includes bias and subjectivity; test administration-related, which concerns the conditions of the test-taking process; and test-related, which is basically related to the nature of a test.[24][21][25]
Validity
Valid assessment measures what it is intended to measure. For example, it would not be valid to assess driving skills through a written test alone. A more valid way of assessing driving skills would be through a combination of tests that help determine what a driver knows, such as a written test of driving knowledge, and what a driver can do, such as a performance assessment of actual driving. Teachers frequently complain that some examinations do not properly assess the syllabus on which they are based; they are effectively questioning the validity of the exam.
Validity of an assessment is generally gauged through examination of evidence in the following categories:
- Content validity – Does the content of the test measure stated objectives?
- Criterion validity – Do scores correlate to an outside reference? (e.g., Do high scores on a 4th-grade reading test accurately predict reading skill in future grades?)
- Construct validity – Does the assessment correspond to other significant variables? (ex: Do ESL students consistently perform differently on a writing exam than native English speakers?)[26]
- consequential validity
- face validity
A good assessment has both validity and reliability, along with the other quality attributes noted above, for a specific context and purpose. In practice, an assessment is rarely totally valid or totally reliable. A ruler that is marked wrongly will always give the same (wrong) measurements. It is very reliable, but not very valid. Asking random individuals to tell the time without looking at a clock or watch is sometimes used as an example of an assessment that is valid, but not reliable. The answers will vary between individuals, but the average answer is probably close to the actual time. In many fields, such as medical research, educational testing, and psychology, there will often be a trade-off between reliability and validity. A history test written for high validity will have many essay and fill-in-the-blank questions. It will be a good measure of mastery of the subject, but it will be difficult to score completely accurately. A history test designed for high reliability will be entirely multiple-choice. It isn't as good at measuring historical knowledge, but it can be scored with great precision. We may generalize from this. The more reliable our estimate is of what we purport to measure, the less certain we are that we are actually measuring that aspect of attainment.
It is well to distinguish between "subject-matter" validity and "predictive" validity. The former, widely used in education, predicts the score a student would get on a similar test with different questions. The latter, used widely in the workplace, predicts performance. Thus, a subject-matter-valid test of knowledge of driving rules is appropriate, while a predictive test would assess whether the potential driver could follow those rules.
Practicality
This principle refers to the time and cost constraints during the construction and administration of an assessment instrument.[21] Meaning that the test should be economical to provide. The format of the test should be simple to understand. Moreover, solving a test should remain within a suitable time frame. It is generally simple to administer. Its assessment procedure should be particular and time-efficient.[25]
Authenticity
The assessment instrument is authentic when it is contextualized, contains natural language and meaningful, relevant, and interesting topics, and replicates real-world experiences.[21]
Washback
This principle concerns the consequences of assessment for teaching and learning in classrooms.[21] Washback can be positive and negative. Positive washback refers to the desired effects of a test, while negative washback refers to the negative consequences of a test. To achieve positive washback, instructional planning can be used.[27]
Evaluation standards
In the field of evaluation, and in particular educational evaluation in North America, the Joint Committee on Standards for Educational Evaluation has published three sets of standards for evaluations. The Personnel Evaluation Standards were published in 1988,[28] The Program Evaluation Standards (2nd edition) were published in 1994,[29] and The Student Evaluation Standards were published in 2003.[30]
Each publication presents and elaborates a set of standards for use in a variety of educational settings. The standards provide guidelines for designing, implementing, assessing, and improving the identified evaluation method. Each standard has been placed in one of four fundamental categories to promote educational evaluations that are appropriate, useful, feasible, and accurate. In these sets of standards, validity and reliability considerations are covered under the accuracy topic. For example, the student accuracy standards help ensure that student evaluations will provide sound, accurate, and credible information about student learning and performance.
In the UK, an award in Training, Assessment and Quality Assurance (TAQA) is available to assist staff in learning and developing good practice in educational assessment across adult, further, and work-based education and training contexts.[31]
Grade inflation
Grade inflation (also known as grading leniency) is the general awarding of higher grades for the same quality of work over time, which devalues grades.[32] However, higher average grades in themselves do not prove grade inflation. For this to be grade inflation, it is necessary to demonstrate that the quality of work does not deserve the high grade.[32]
Due to grade inflation, standardized tests can have greater validity than unstandardized exam scores.[33] Recently increasing graduation rates can be partially attributed to grade inflation.[34]
Summary table of the main theoretical frameworks
The following table summarizes the main theoretical frameworks underpinning almost all theoretical and research work, as well as instructional practices in education (one of which, of course, is the practice of assessment). These different frameworks have given rise to interesting debates among scholars.
| Topics | Empiricism | Rationalism | Socioculturalism |
|---|---|---|---|
| Philosophical orientation | Hume: British empiricism | Kant, Descartes: Continental rationalism | Hegel, Marx: cultural dialectic |
| Metaphorical orientation | Mechanistic/Operation of a Machine or Computer | Organismic/Growth of a Plant | Contextualist/Examination of a Historical Event |
| Leading theorists | B. F. Skinner (behaviorism)/ Herb Simon, John Anderson, Robert Gagné: (cognitivism) | Jean Piaget/Robbie Case | Lev Vygotsky, Luria, Bruner/Alan Collins, Jim Greeno, Ann Brown, John Bransford |
| Nature of mind | Initially, a blank device that detects patterns in the world and operates on them. Qualitatively identical to lower animals, but quantitatively superior. | Organ that evolved to acquire knowledge by making sense of the world. Uniquely human, qualitatively different from lower animals. | Unique among species for developing language, tools, and education. |
| Nature of knowledge (epistemology) | Hierarchically organized associations present an accurate but incomplete representation of the world. Assumes that the sum of the components of knowledge is the same as the whole. Because knowledge is accurately represented by components, one who demonstrates those components is presumed to know | General and/or specific cognitive and conceptual structures, constructed by the mind and according to rational criteria. Essentially, these are the higher-level structures that are constructed to assimilate new information into existing structures, and as the structures accommodate more new information. Knowledge is represented by the ability to solve new problems. | Distributed across people, communities, and the physical environment. It represents the culture of a community that continues to create it. To know means to be attuned to the constraints and affordances of systems in which activity occurs. Knowledge is represented in the regularities of successful activity. |
| Nature of learning (the process by which knowledge is increased or modified) | Forming and strengthening cognitive or S-R associations. Generation of knowledge by (1) exposure to patterns, (2) efficiently recognizing and responding to patterns, and (3) recognizing patterns in other contexts. | Engaging in an active process of making sense of ("rationalizing") the environment. Mind applies existing structure to new experience to rationalize it. You don't really learn the components; you only learn the structures needed to handle them later. | Increasing ability to participate in a particular community of practice. Initiation into the life of a group, strengthening the ability to participate by becoming attuned to constraints and affordances. |
| Features of authentic assessment | Assess knowledge components. Focus on mastery of many components and fluency. Use psychometrics to standardize. | Assess extended performance on new problems. Credit varieties of excellence. | Assess participation in inquiry and social practices of learning (e.g., portfolios, observations). Students should participate in the assessment process. Assessments should be integrated into a larger environment. |
Controversy
Concerns over how best to apply assessment practices across public school systems have largely focused on the use of high-stakes and standardized tests, often used to gauge student progress, teacher quality, and school-, district-, or statewide educational success.
No Child Left Behind
For most researchers and practitioners, the question is not whether tests should be administered at all; there is a consensus that, when administered in useful ways, tests can provide information about student progress and curriculum implementation, as well as serve formative purposes for learners.[35] The real issue, then, is whether testing practices as currently implemented can provide these services for educators and students.
President Bush signed the No Child Left Behind Act (NCLB) on January 8, 2002. The NCLB Act reauthorized the Elementary and Secondary Education Act (ESEA) of 1965. President Johnson signed the ESEA to help fight the War on Poverty and helped fund elementary and secondary schools. President Johnson's goal was to emphasize equal access to education and establish high standards and accountability. The NCLB Act required states to develop assessments in basic skills. To receive federal school funding, states had to give these assessments to all students at select grade levels.
In the U.S., the No Child Left Behind Act mandates standardized testing nationwide. These tests align with the state curriculum and link teacher, student, district, and state accountability to their results. Proponents of NCLB argue that it offers a tangible method of gauging educational success, holding teachers and schools accountable for failing scores, and closing the achievement gap across class and ethnicity.[36]
Opponents of standardized testing dispute these claims, arguing that holding educators accountable for test results leads to "teaching to the test." Additionally, many argue that the focus on standardized testing encourages teachers to equip students with a narrow set of skills that enhance test performance without actually fostering a deeper understanding of subject matter or key principles within a knowledge domain.[37]
High-stakes testing
The assessments, which have caused much controversy in the U.S., involve the use of high school graduation examinations, which are used to deny diplomas to students who have attended high school for four years but cannot demonstrate that they have learned the required material on exams. Opponents say that no student who has put in four years of seat time should be denied a high school diploma merely for repeatedly failing a test, or even for not knowing the required material.[38][39][40]
High-stakes tests have been blamed for causing sickness and test anxiety in students and teachers, and for teachers narrowing the curriculum to what they believe will be tested. In an exercise designed to make children comfortable about testing, a Spokane, Washington newspaper published a picture of a monster that feeds on fear.[41] The published image is purportedly the response of a student who was asked to draw a picture of what she thought of the state assessment.
Other critics, such as Washington State University's Don Orlich, question the use of test items far beyond standard cognitive levels for students' age.[42]
Compared to portfolio assessments, simple multiple-choice tests are much less expensive, less prone to scorer disagreement, and can be scored quickly enough to be returned before the end of the school year. Standardized tests (in which all students take the same test under the same conditions) often use multiple-choice questions for these reasons. Orlich criticizes the use of expensive, holistically graded tests, rather than inexpensive multiple-choice "bubble tests", to measure the quality of both the system and individuals for very large numbers of students.[42] Other prominent critics of high-stakes testing include Fairtest and Alfie Kohn.
The use of IQ tests has been banned in some states for educational decisions, and norm-referenced tests, which rank students from "best" to "worst", have been criticized for bias against minorities. Most education officials support criterion-referenced tests (each student's score depends solely on whether he answered the questions correctly, regardless of whether his neighbors did better or worse) for making high-stakes decisions.
21st century assessment
It has been widely noted that with the emergence of social media and Web 2.0 technologies and mindsets, learning is increasingly collaborative, and knowledge is increasingly distributed among many members of a learning community. Traditional assessment practices, however, focus largely on the individual and fail to account for knowledge building and learning in context. As researchers in the field of assessment consider the cultural shifts arising from the emergence of a more participatory culture, they will need to find new methods for applying assessments to learners.[43]
Large-scale learning assessment
Large-scale learning assessments (LSLAs) are system-level assessments that provide a snapshot of learning achievement for a group of learners in a given year across a limited number of domains. They are often categorized as national or cross-national assessments and draw attention to issues related to levels of learning and determinants of learning, including teacher qualification; the quality of school environments; parental support and guidance; and social and emotional health in and outside schools.[44]
Assessment in a democratic school
The Sudbury model of democratic education schools does not perform and does not offer assessments, evaluations, transcripts, or recommendations. They assert that they do not rate people and that school is not a judge; comparing students to each other or to some set standard is, for them, a violation of students' right to privacy and self-determination. Students decide for themselves how to measure their progress as self-starting learners as a process of self-evaluation: real lifelong learning and the proper educational assessment for the 21st century, they allege.[45]
According to Sudbury schools, this policy does not cause harm to their students as they move on to life outside the school. However, they admit it makes the process more difficult, but that such hardship is part of the students' learning to make their own way, set their own standards, and meet their own goals.
The no-grading and no-rating policy helps create an atmosphere free of competition among students or battles for adult approval, and encourages a positive, cooperative environment among the student body.[46]
The final stage of a Sudbury education, should the student choose to take it, is the graduation thesis. Each student writes about how they have prepared themselves for adulthood and for entering the community at large. This thesis is submitted to the Assembly for review. The final stage of the thesis process is an oral defense by the student, during which they open the floor to questions, challenges, and comments from all Assembly members. At the end, the Assembly votes by secret ballot on whether or not to award a diploma.[47]
Assessing ELL students
A major concern with the use of educational assessments is their overall validity, accuracy, and fairness when assessing English language learners (ELLs). The majority of assessments within the United States have normative standards based on the English-speaking culture, which does not adequately represent ELL populations.[citation needed] Consequently, it would, in many cases, be inaccurate and inappropriate to conclude ELL students' normative scores. Research shows that most schools do not appropriately modify assessments to accommodate students from diverse cultural backgrounds.[citation needed] This has resulted in the over-referral of ELL students to special education, leading to their disproportionate representation in special education programs. Although some may see this inappropriate placement in special education as supportive and helpful, research has shown that students inappropriately placed actually regressed in their progress.[citation needed]
It is often necessary to use a translator to administer the assessment in an ELL student's native language; however, there are several issues with translating assessment items. One issue is that translations can frequently suggest a correct or expected response, changing the difficulty of the assessment item.[48] Additionally, the translation of assessment items can sometimes distort the original meaning of the item.[48] Finally, many translators are not qualified or properly trained to work with ELL students in an assessment situation.[citation needed] All of these factors compromise the validity and fairness of assessments, making the results not reliable. Nonverbal assessments are less discriminatory for ELL students; however, some still exhibit cultural biases in their items.[48]
When considering an ELL student for special education the assessment team should integrate and interpret all of the information collected to ensure a non biased conclusion.[48] The decision should be based on multidimensional sources of data including teacher and parent interviews, as well as classroom observations.[48] Decisions should take the students unique cultural, linguistic, and experiential backgrounds into consideration, and should not be strictly based on assessment results.
Universal screening
Assessment can be associated with disparity when students from traditionally underrepresented groups are excluded from tests required for access to certain programs or opportunities, as is the case with gifted programs. One way to combat this disparity is universal screening, which involves testing all students (e.g., for giftedness) rather than testing only some based on teachers' or parents' recommendations. Universal screening results in large increases in traditionally underserved groups (such as Black, Hispanic, poor, female, and ELLs) identified for gifted programs, without the standards for identification being modified in any way.[49]
See also
- Academic equivalency evaluation
- Computer aided assessment
- Concept inventory
- Confidence-based learning accurately measures a learner's knowledge quality by measuring both the correctness of their knowledge and the learner's confidence in that knowledge.
- E-scape, a technology and approach that looks specifically at the assessment of creativity and collaboration.
- Educational aims and objectives
- Educational evaluation deals specifically with evaluation as it applies to an educational setting. As an example, it may be used in the No Child Left Behind (NCLB) government program instituted by the government of the U.S.
- Electronic portfolio is a personal digital record containing information such as a collection of artifacts or evidence demonstrating what one knows and can do.
- Evaluation is the process of looking at what is being assessed to make sure the right areas are being considered.
- Grading is the process of assigning a (possibly mutually exclusive) ranking to learners.
- Health impact assessment looks at the potential health impacts of policies, programs, and projects.
- Macabre constant is a theoretical bias in educational assessment
- Educational measurement is a process of assessment or an evaluation in which the objective is to quantify the level of attainment or competence within a specified domain. See the Rasch model for measurement for elaboration on the conceptual requirements of such processes, including those of grading and use of raw scores from assessments.
- Program evaluation is essentially a set of philosophies and techniques to determine if a program "works".
- Progress testing
- Psychometrics, the science of measuring psychological characteristics.
- Psychological testing
- Rubrics for assessment
- Science, technology, society and environment education
- Social impact assessment looks at the possible social impacts of proposed new infrastructure projects, natural resource projects, or development activities.
- Standardized testing is any test that is used across a variety of schools or other situations.
- Standards-based assessment
- Robert E. Stake is an educational researcher in the field of curriculum assessments.
- Writing assessment
- Metric fixation