When "Responsible AI" Needs to Be Earned, Not Branded

I have been working with GPTs since 2022 and Claude since 2023. Today I use it many times during the week as a collaborator and thought partner, including Claude Code inside an IDE and Cowork for agentic workflows. I am an AI consumer, a developer, a researcher, an academic instructor and AI Trainer. I share this because what follows is a close look at Anthropic's AI Fluency Report, its methodology, research standards, data governance, what fluency actually measures, and the broader policy context education leaders should consider.

AI is moving fast. That is not new information. But in education, adoption has been slow, and the gap between what the technology can do and what institutions are preparing students for continues to widen. That gap is exactly where fluency lives.

Over the past year, through conversations with leadership and colleagues across multiple universities, I have been making the case for AI fluency adoption, NOT awareness training. Not literacy in the general sense of knowing AI exists. Fluency, meaning the kind of understanding that moves people from knowing about AI to actually knowing how to think with it, evaluate it, direct it, and maintain meaningful oversight as it becomes more capable and more autonomous. I had been building toward that argument since 2021.

On February 23, 2026, Anthropic published their AI Fluency Report. I have been waiting for something like it and excited to see the needle move in that direction. I read it carefully, alongside the primary documents it draws from. And I have concerns I think are worth naming in public, especially for the educators, administrators, and researchers who will be asked to take this report seriously.

How These Findings Fit the Existing Literature

The two main findings in this report are worth examining against what the learning sciences already tell us. The finding that iterative, back-and-forth engagement with AI is associated with higher rates of every other fluency behavior is consistent with decades of research on dialogic learning and productive struggle. Vygotsky’s work on the zone of proximal development, Bereiter and Scardamalia’s knowledge building framework, and Kapur’s (2016) research on productive failure all point in the same direction: sustained cognitive engagement produces better outcomes than transactional exchange. Anthropic has not discovered something new here. They have confirmed an existing pattern in a new context, which is genuinely useful but also means the finding should not surprise educators or be treated as novel evidence. Holmes, Bialik, and Fadel (2019) made this same case years ago when they outlined why educational AI measurement matters and situated it within a longer history of promises about technology in learning.

The finding that users become less evaluative when AI produces polished-looking outputs is equally consistent with existing research on automation bias, the well-documented tendency for people to reduce their own scrutiny and over-rely on automated outputs when something appears finished and functional (Parasuraman and Manzey, 2010). The fact that this pattern appears in AI interactions is important to document. The fact that it appears there should not be surprising to anyone familiar with the cognitive science literature. This concern is amplified by emerging evidence on cognitive offloading. Gerlich (2025) found that sustained AI tool use was associated with reduced engagement and diminished independent reasoning, while a Microsoft Research study (Lee et al., 2025) documented measurable declines in critical thinking among users who relied heavily on AI assistance. These findings give additional weight to the automation bias pattern and suggest the problem may extend beyond evaluation into deeper cognitive processes.

What this means is that the AI Fluency Report’s descriptive findings land in a place the literature already anticipated. The more pressing question is whether the measurement instrument is sophisticated enough to move beyond confirming what we already know and start generating genuinely new insight about how fluency develops over time. That is where the concerns begin. As Selwyn (2022) has argued in a broader critical perspective on AI in education, cautionary analysis of this kind is essential precisely because the stakes of getting educational AI measurement wrong are high and the enthusiasm often outpaces the evidence.

But a baseline is only useful if you trust the measurement behind it.

When Methods and Language Diverge

The report uses the word “correlation” throughout but never actually reports a correlation coefficient. There is no Pearson r, no Spearman’s rho, no chi-square statistic, nothing. What is actually being presented is subgroup prevalence comparisons: 8,424 conversations where users iterated and refined their work versus 1,406 conversations where they did not. Comparing percentages across two groups is cross-tabulation. It is a useful descriptive tool, but it is not correlation analysis (Dancey and Reidy, 2017; Mukaka, 2012), and that distinction matters considerably when making claims about relationships between behaviors.

This is not a technical oversight. The Clio paper, published in December 2024 and the privacy-preserving analysis tool used in this very study, reports a Pearson r of 0.71 when comparing its own classifier outputs. Anthropic knows how to run and report correlations. The decision not to in the February 2026 fluency index is a design choice, and education researchers will notice it.

There is also a logical problem in how the main finding is framed. Iteration and refinement is itself one of the 11 fluency behaviors being studied. So when the report finds that conversations showing iteration also show higher rates of all the other fluency behaviors, the relationship may simply reflect that some users are more engaged overall rather than something specific about iteration causing the other behaviors to appear. The analysis cannot distinguish between those two explanations, and the report does not try to.

Important Questions About Research Standards

This is a question worth raising carefully rather than definitively, because it touches on something the education and research community genuinely needs to work through together.

Anthropic could reasonably position the AI Fluency Report as internal product analytics. They are studying their own platform using their own tool for the purpose of understanding how people use it. Under that framing, no institutional review board approval, no formal peer review process, and no detailed methods documentation is required. That is a legitimate and defensible position for a company to take.

But this report also brings in named academic researchers to build the measurement framework, publishes its findings under the formal label of an Education Report, and makes claims that university administrators and instructional designers will cite when making decisions about curriculum, policy, and institutional AI adoption. The Belmont Report (National Commission, 1979), which remains the foundational document governing human subjects research in the United States, is clear that when the purpose of an activity is to generate generalizable knowledge about human behavior, that activity constitutes research and requires ethical oversight. The Common Rule (U.S. Department of Health and Human Services, 2018), the federal regulatory standard codifying those principles, applies to federally funded research and sets the professional standard that IRB review follows even when research is not federally funded. The American Educational Research Association’s Code of Ethics (AERA, 2011) applies those same principles specifically to education research.

The randomized controlled trial on coding skills, published in January 2026 alongside this report, studied 52 human participants in a controlled experiment. The report does not mention IRB approval or equivalent ethics review; this is a question the field should consider as AI education research expands.

The intent here is not to suggest bad faith. Rather, it highlights the need for clearer norms around what constitutes human subjects research in the context of AI studies, and how these norms should evolve as companies increasingly publish education-focused research.

The Data Governance Question Institutions Are Not Asking

Here is a structural question every university and school district adopting Claude needs to understand. When an institution purchases Claude at the enterprise or team level, the organization negotiates its own data terms. At that tier, Anthropic does not by default use conversations to train future models. But who at the institution actually made that decision? Was it the IT department? Legal counsel? The provost’s office? Did faculty know that decision had been made on their behalf? Did students? An institution can opt an entire community in or out of training data use without individuals having any awareness that a choice was made for them.

At the individual consumer level, which is where many faculty and students were using Claude before their institutions adopted it formally, the default is different. Free and Pro tier users’ conversations are used to train future models unless users actively opt out, a policy detailed in Anthropic’s privacy policy updated in August 2025. The murkiest situation is the overlap between the two: when a university provides institutional access but a faculty member also holds a personal account, which conversations fall under which terms likely depends on how that user accessed Claude in a given session and may not be clear even to the user.

This is not an accusation directed at Anthropic. It is a governance question that higher education has not yet answered, and one that institutions and AI providers need to work through together with transparency. The February 2026 AI Fluency Report was built on 9,830 conversations drawn from a single week in January 2026. The users in those conversations did not know their interactions might contribute to a published behavioral study. Whether that is acceptable depends entirely on which category this report belongs to. That question remains open.

Who Exactly Is Being Studied

The tool used to conduct this analysis, Clio, is designed to protect user privacy by aggregating and anonymizing conversations so that individual users cannot be identified by researchers. That privacy protection is genuinely important and worth acknowledging. But it also means the same researchers cannot verify that any given user is a student or an educator. Classification by role depends entirely on inference from conversation content, and inference is not the same as identification.

Previous reports in this series studied university students in April 2025 and educators in August 2025, characterizing them as distinct populations worth studying separately. The educator report itself acknowledged this challenge directly, noting that the platform does not collect self-reported occupational data and that distinguishing student from educator conversations is genuinely difficult. The data infrastructure does not fully support the population distinctions being drawn, and that matters when findings are framed in educational terms and used to justify adoption in educational institutions.

What Does AI Fluency Actually Measure

The 4D AI Fluency Framework at the heart of this report identifies four distinct competency areas: Delegation, Description, Discernment, and Diligence. Each is described as an interconnected collection of skills, knowledge, insights, and values. The framework implies that these four competencies are meaningfully distinct and together provide a complete picture of AI fluency.

However, the fluency index report does not analyze findings by competency area. The visualizations color-code each of the 11 behaviors by competency, but the four-part structure ends there. There is no analysis of whether users are developing these competencies evenly, whether certain areas lag, or whether strengths and weaknesses cluster in meaningful ways.

A university administrator reading this report cannot determine whether their community is competent in Discernment but underdeveloped in Delegation. The framework promises that level of insight, but the report does not deliver it.

Examining the behavioral indicator chart shows that of the four competencies, three are directly observable in the measurement instrument. Description accounts for seven of the 11 observable behaviors, Delegation accounts for two, and Discernment accounts for three. Diligence, which captures taking responsibility for AI use and its consequences, cannot be directly measured through conversation data. Thirteen of the 24 framework behaviors are unobservable. For educators, this distinction is important, as it clarifies which aspects of fluency the instrument can assess and which require other approaches for evaluation.

However, I was looking to understand how the 11 behaviors were assigned to their competency areas. There is no methods section describing the assignment criteria, no inter-rater reliability statistics, and no factor analysis or structural modeling demonstrating that behaviors cluster empirically. The assignments appear to rely on face validity alone, which is the weakest form of construct validity in measurement science (Messick, 1989; Kane, 2013).

If Discernment is genuinely distinct from Description, then the behaviors assigned to it should correlate more strongly with each other than with Description behaviors. Without testing this structure, it is unclear whether the four-quadrant framework reflects an underlying pattern in AI fluency or merely an appealing conceptual organization.

The same concern applies to the 11 fluency behaviors themselves. Each is coded as present or absent in a conversation. If AI fluency is a genuine skill that develops over time, these behaviors should form a difficulty hierarchy with theoretically meaningful and empirically testable ordering (Bond and Fox, 2015; Wright and Masters, 1982).

Approaches like Rasch modeling were designed for this purpose (Rasch, 1960), and could determine whether these items measure a coherent construct, identify the most diagnostic behaviors, and track growth over time. Binary prevalence percentages cannot answer these questions, so the measurement model does not support the developmental claims made.

This concern is not new. Long and Magerko (2020) and Ng and colleagues (2021) emphasize that AI literacy and fluency are multidimensional constructs that require careful definition and validated measurement instruments. Laupichler and colleagues (2022) reached similar conclusions for higher education specifically. The infrastructure for rigorous measurement exists; this report does not leverage it.

The Broader Policy Context

On February 24, 2026, one day after the AI Fluency Report published, Anthropic released Version 3.0 of its Responsible Scaling Policy. The RSP, which Anthropic introduced in September 2023, is the voluntary safety framework the company has used to govern how it develops increasingly capable AI models.

The new version is worth reading in full rather than relying on headlines about it. Anthropic is candid in the document about what worked and what did not, and that transparency deserves acknowledgment. They write that their theory of change for influencing the broader industry only partially succeeded, that pre-set capability thresholds proved far more ambiguous than anticipated, and that government action on AI safety has moved slowly while the political environment has shifted toward prioritizing competitiveness over safety. The original hard commitment not to release models unless adequate safety measures were already in place has been replaced with a Frontier Safety Roadmap of public goals that Anthropic describes explicitly as nonbinding but publicly declared targets.

This followed an earlier and less visible change. In May 2025, journalist Garrison Lovely reported that when Anthropic released Claude 4 Opus and classified it as ASL-3, meaning it had potential to assist in producing biological weapons, the company had quietly dropped a specific 2023 commitment to fully define ASL-4 standards before releasing any ASL-3 model. That change appeared in a redline PDF and was not publicly announced.

I want to be clear: I think Anthropic is making a genuine effort to navigate an extraordinarily difficult landscape, and the transparency in RSP v3 reflects that. But the education community should understand the full timeline. Between September 2023 and February 2026, a period during which Anthropic significantly expanded its education products and partnerships, the company’s safety commitments moved from hard pre-conditions to public but nonbinding goals. That context matters when institutions are being asked to make long-term adoption decisions based in part on Anthropic’s reputation as the responsible AI lab.

Where the Research Could Go Next

Cohort analyses that track how fluency behaviors develop in the same users over time. Reported statistical methods that actually match the language used to describe findings. Documented ethics review processes for work involving human subjects data. A structural validity analysis demonstrating that the four competency areas are empirically distinct and that behavior assignments to those areas are defensible. A plan for assessing Diligence behaviors through qualitative or mixed methods given they cannot be observed through conversation data alone. Measurement models that go beyond binary classification and can support the developmental claims being made. Findings broken down by competency area so practitioners can identify specific gaps. And explicit plain-language guidance for institutions about what data decisions are being made on behalf of their communities.

I have been developing an evaluation framework that takes these measurement questions seriously, combining behavioral signals with more rigorous psychometric modeling. I will be sharing more on that through this site.

In the meantime, if you are a university administrator, faculty member, or instructional designer planning to use this report to inform decisions about AI adoption or curriculum, read the limitations section carefully. Then ask the questions that section does not ask for you.

As I mentioned, Claude is a part of most of my workflows. I think it is a remarkable tool and I want this line of research to succeed. That is precisely why I want the work to be held to the highest standards. The conversation about what AI fluency really means, how to measure it honestly, and who gets to define it is just getting started. It is a conversation education needs to lead, not simply receive.

References

American Educational Research Association. (2011). Code of ethics. https://www.aera.net/Portals/38/docs/About_AERA/CodeOfEthics(1).pdf

Anthropic. (2023, September). Anthropic’s responsible scaling policy. https://www.anthropic.com/news/anthropics-responsible-scaling-policy

Anthropic. (2024, December 12). Clio: A system for privacy-preserving insights into real-world AI use. https://www.anthropic.com/research/clio

Anthropic. (2025, April 8). Anthropic education report: How university students use Claude. https://www.anthropic.com/news/anthropic-education-report-how-university-students-use-claude

Anthropic. (2025, August 27). Anthropic education report: How educators use Claude. https://www.anthropic.com/news/anthropic-education-report-how-educators-use-claude

Anthropic. (2025). Introducing Claude for Education. https://www.anthropic.com/news/introducing-claude-for-education

Anthropic. (2025, August). Privacy policy. https://www.anthropic.com/legal/privacy

Anthropic. (2026, January 29). How AI assistance impacts the formation of coding skills. https://www.anthropic.com/research/AI-assistance-coding-skills

Anthropic. (2026, February 23). AI Fluency Report. https://www.anthropic.com/research/AI-fluency-index

Anthropic. (2026, February 24). Responsible Scaling Policy Version 3.0. https://www.anthropic.com/news/responsible-scaling-policy-v3

Bond, T. G., and Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences (3rd ed.). Routledge.

Dakan, R., Feller, J., and Anthropic. (2025). The AI Fluency Framework. Released under CC BY-NC-SA 4.0. https://www-cdn.anthropic.com/334975cdec18f744b4fa511dc8518bd8d119d29d.pdf

Dancey, C. P., and Reidy, J. (2017). Statistics without maths for psychology. Pearson.

Gerlich, M. (2025). AI tools in society: Impacts on cognitive offloading and the future of critical thinking. Societies, 15(1), 6.

Holmes, W., Bialik, M., and Fadel, C. (2019). Artificial intelligence in education: Promises and implications for teaching and learning. Center for Curriculum Redesign.

Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73.

Kapur, M. (2016). Examining productive failure, productive success, unproductive failure, and unproductive success in learning. Educational Psychologist, 51(2), 289–299.

Laupichler, M. C., Aster, A., Schirch, J., and Raupach, T. (2022). Artificial intelligence literacy in higher and adult education: A scoping literature review. Computers and Education: Artificial Intelligence, 3, 100101.

Lee, H., et al. (2025). The impact of AI assistance on critical thinking and cognitive effort. Microsoft Research.

Long, D., and Magerko, B. (2020). What is AI literacy? Competencies and design considerations. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 1–16.

Lovely, G. (2025, May 23). Anthropic is quietly backpedalling on its safety commitments. LessWrong / Obsolete. https://www.lesswrong.com/posts/HE2WXbftEebdBLR9u/anthropic-is-quietly-backpedalling-on-its-safety-commitments

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). American Council on Education.

Mukaka, M. M. (2012). A guide to appropriate use of correlation coefficient in medical research. Malawi Medical Journal, 24(3), 69–71.

National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. (1979). The Belmont Report. U.S. Department of Health and Human Services. https://www.hhs.gov/ohrp/regulations-and-policy/belmont-report/index.html

Ng, D. T. K., Leung, J. K. L., Chu, S. K. W., and Qiao, M. S. (2021). Conceptualizing AI literacy: An exploratory review. Computers and Education: Artificial Intelligence, 2, 100041.

Parasuraman, R., and Manzey, D. H. (2010). Complacency and bias in human use of automation: An attentional integration. Human Factors, 52(3), 381–410.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Danish Institute for Educational Research.

Selwyn, N. (2022). The future of AI and education: Some cautionary notes. European Journal of Education, 57(4), 620–631.

TIME. (2026, February 25). Exclusive: Anthropic drops flagship safety pledge. https://time.com/7380854/exclusive-anthropic-drops-flagship-safety-pledge/

U.S. Department of Health and Human Services. (2018). Federal Policy for the Protection of Human Subjects (the Common Rule). 45 CFR 46.

When “Responsible AI” Needs to Be Earned, Not Branded: A Critical Read of Anthropic’s AI Fluency Report

How These Findings Fit the Existing Literature

When Methods and Language Diverge

Important Questions About Research Standards

The Data Governance Question Institutions Are Not Asking

Who Exactly Is Being Studied

What Does AI Fluency Actually Measure

The Broader Policy Context

Where the Research Could Go Next

References

Discussion

Work With Us

When “Responsible AI” Needs to Be Earned, Not Branded: A Critical Read of Anthropic’s AI Fluency Report

How These Findings Fit the Existing Literature

When Methods and Language Diverge

Important Questions About Research Standards

The Data Governance Question Institutions Are Not Asking

Who Exactly Is Being Studied

What Does AI Fluency Actually Measure

The Broader Policy Context

Where the Research Could Go Next

References

Discussion

Stay Updated

Work With Us