Table of Contents
Fetching ...

The competent Computational Thinking test (cCTt): a valid, reliable and gender-fair test for longitudinal CT studies in grades 3-6

Laila El-Hamamsy, María Zapata-Cáceres, Estefanía Martín-Barroso, Francesco Mondada, Jessica Dehler Zufferey, Barbara Bruno, Marcos Román-González

TL;DR

This paper addresses the lack of longitudinal, developmentally appropriate CT assessments in primary school by validating the competent CT test (cCTt) across Grades 3–6. It combines Classical Test Theory and Item Response Theory analyses with measurement invariance and differential item functioning to establish validity, reliability, and gender fairness, while introducing normalised scoring and proficiency profiles to enable cross-grade comparability. Key contributions include grade-specific validity and reliability evidence, gender fairness confirmation, and the development of proficiency profiles and Wright maps to track cognitive maturation, plus normalised scoring to bridge cCTt with CTt and related instruments. The findings support using the cCTt for multi-year CT studies and provide practical tools for researchers, educators, and practitioners, while highlighting areas for item enrichment in higher grades and opportunities for cross-country validation and instrument transitions.

Abstract

The introduction of computing education into curricula worldwide requires multi-year assessments to evaluate the long-term impact on learning. However, no single Computational Thinking (CT) assessment spans primary school, and no group of CT assessments provides a means of transitioning between instruments. This study therefore investigated whether the competent CT test (cCTt) could evaluate learning reliably from grades 3 to 6 (ages 7-11) using data from 2709 students. The psychometric analysis employed Classical Test Theory, Item Response Theory, Measurement Invariance analyses which include Differential Item Functioning, normalised z-scoring, and PISA's methodology to establish proficiency levels. The findings indicate that the cCTt is valid, reliable and gender-fair for grades 3-6, although more complex items would be beneficial for grades 5-6. Grade-specific proficiency levels are provided to help tailor interventions, with a normalised scoring system to compare students across and between grades, and help establish transitions between instruments. To improve the utility of CT assessments among researchers, educators and practitioners, the findings emphasise the importance of i) developing and validating gender-fair, grade-specific, instruments aligned with students' cognitive maturation, and providing ii) proficiency levels, and iii) equivalency scales to transition between assessments. To conclude, the study provides insight into the design of longitudinal developmentally appropriate assessments and interventions.

The competent Computational Thinking test (cCTt): a valid, reliable and gender-fair test for longitudinal CT studies in grades 3-6

TL;DR

This paper addresses the lack of longitudinal, developmentally appropriate CT assessments in primary school by validating the competent CT test (cCTt) across Grades 3–6. It combines Classical Test Theory and Item Response Theory analyses with measurement invariance and differential item functioning to establish validity, reliability, and gender fairness, while introducing normalised scoring and proficiency profiles to enable cross-grade comparability. Key contributions include grade-specific validity and reliability evidence, gender fairness confirmation, and the development of proficiency profiles and Wright maps to track cognitive maturation, plus normalised scoring to bridge cCTt with CTt and related instruments. The findings support using the cCTt for multi-year CT studies and provide practical tools for researchers, educators, and practitioners, while highlighting areas for item enrichment in higher grades and opportunities for cross-country validation and instrument transitions.

Abstract

The introduction of computing education into curricula worldwide requires multi-year assessments to evaluate the long-term impact on learning. However, no single Computational Thinking (CT) assessment spans primary school, and no group of CT assessments provides a means of transitioning between instruments. This study therefore investigated whether the competent CT test (cCTt) could evaluate learning reliably from grades 3 to 6 (ages 7-11) using data from 2709 students. The psychometric analysis employed Classical Test Theory, Item Response Theory, Measurement Invariance analyses which include Differential Item Functioning, normalised z-scoring, and PISA's methodology to establish proficiency levels. The findings indicate that the cCTt is valid, reliable and gender-fair for grades 3-6, although more complex items would be beneficial for grades 5-6. Grade-specific proficiency levels are provided to help tailor interventions, with a normalised scoring system to compare students across and between grades, and help establish transitions between instruments. To improve the utility of CT assessments among researchers, educators and practitioners, the findings emphasise the importance of i) developing and validating gender-fair, grade-specific, instruments aligned with students' cognitive maturation, and providing ii) proficiency levels, and iii) equivalency scales to transition between assessments. To conclude, the study provides insight into the design of longitudinal developmentally appropriate assessments and interventions.
Paper Structure (46 sections, 1 equation, 10 figures, 20 tables)

This paper contains 46 sections, 1 equation, 10 figures, 20 tables.

Figures (10)

  • Figure 1: Two main question formats of cCTt: grid (left) and canvas (right) (Figure taken from elhamamsy_competent_2022).
  • Figure 2: Distribution of scores across grades
  • Figure 3: IRT Theory plots, taken from el-hamamsy_comparing_2022 (A - top left) Item Characteristic Curves for four items of equal discrimination (slope) and varying difficulty (using a 1-PL model on the cCTt test data). The item's difficulty ($b_i$) is the x-value ($\theta$) where the ICC reaches a $y=.5$ probability of answering correctly, and represents the number of standard deviations from the mean the question difficulty is. Items to the left of the graph are considered easier while items on the right are considered harder. (B - top right) Item Characteristic Curves (ICC) for four items (blue, red, green, purple) of varying difficulty and discrimination (using a 2-PL model on cCTt test data). In this example, blue and red items are of equal difficulty $b_i$ ($y=0.5$ crossing) and relatively similar discrimination $a_i$, while items green and purple are of equal difficulty and varying discrimination. As the blue item is steeper, it has a higher discrimination than the red, green and purple items. (C - bottom left) Item Information Curves (IICs) for the items in (B). The bell shaped curves represent the amount of information $I_i$ provided for each of the test's items according to the student's ability $\theta$. These IICs vary in both maximum value (dependent on the item's discriminability, i.e. the ICC slope), and the x-value at which they reach it (the item's difficulty). Here, the blue and red curves, as well as the green and purple curves, have the same difficulty (they both reach their maximum around x=-2 and x=0 respectively), but are of different discriminability: the blue item discriminates more than the red, the red more than the green and the green more than the purple (steeper ICC slope, and higher maximum IIC value). (D - bottom right) Test Information Function (TIF, in blue) for the four items from Fig. \ref{['fig:IRT_theory']}(B) and (C), and the standard error of measurement (SEM, in red). The TIF (blue) is the sum of the instrument's IICs from Fig. \ref{['fig:IRT_theory']}(B) and (C), and the SEM is the square root of the variance. The TIF shows that the instrument displays maximum information around -2 and provides more information in the low-medium ability range than in the high ability range. The SEM (red) is at its lowest where the test provides the most information (maximum of the TIF) and at its highest where the test provides the least information (minimum of the TIF).
  • Figure 4: Classical Test Theory - Item Difficulty Index (left) and Point-biserial correlation (right). Please note that items with a difficulty index above the $.85$ threshold are considered too easy while items below the $.25$ threshold are considered too difficult. Similarly, items with a point-biserial correlation above the $.2$ threshold are considered acceptable while those above $.25$ are considered good.
  • Figure 5: 2-PL IRT Item Characteristic Curves per grade.
  • ...and 5 more figures