Norma | AI Agents for the Real World

Before saying AI is replacing humans, maybe we should first ask a simpler question:

Can AI actually reason well when the pattern is new?

That is exactly why the Abstraction and Reasoning Corpus (ARC) is interesting.

ARC is a benchmark introduced in 2019 by François Chollet in his paper On the Measure of Intelligence. The idea behind it is simple but powerful: intelligence should not be measured only by performance on fixed tasks, because raw skill can be inflated by massive training data, memorization, or narrow optimization. Instead, Chollet argues that intelligence is closer to skill-acquisition efficiency, the ability to adapt, infer, and solve new problems from limited experience.

Based on the HLE. Even the BEST models Still <50% accuracy while Humans (experts in domain) >~ 90%

That is what makes ARC different from most AI benchmarks.

It does not test whether a model has seen the task before.
It tests whether it can look at a few examples, detect the hidden abstraction, and generalize it to a new case. ARC was built around a small set of human-like priors, things like objects, counting, geometry, symmetry, and spatial relations which makes it feel less like memorization and more like an actual IQ-style test for abstraction.

In that sense, ARC is not just another benchmark.
It is one of the clearest attempts to test whether AI can learn something new on the fly instead of just remixing what it already absorbed. Lab42, which maintains ARC-AGI resources, even describes it as an “IQ test for AI.” Their summary is brutal: humans solve around 80% of ARC tasks on average, while current algorithms reach around 31%.

And that gap is exactly why I got curious.

As someone who likes abstract reasoning puzzles from time to time, I wanted to see how today’s top models would actually behave on this kind of challenge.

So I ran a small experiment with a few ARC-style puzzles.

And honestly?

The results were messy.

Sometimes the model got the answer right with nonsense reasoning.
Sometimes the reasoning looked clean, but the answer was wrong.
Sometimes it was clearly following a shallow shortcut instead of the actual abstraction.

Which is funny at first.

But it also points to something deeper:
Maybe the biggest weakness of current AI is not output generation, it is abstraction.

Out of 6 puzzles, Claude got 1/6 and ChatGPT got 3/6.

These are supposed to be advanced models. Yet on small abstract puzzles the kind many average humans can solve by identifying the underlying concept they still fail in weird and inconsistent ways.

Human reasoning:
The center box is a mix of the surrounding shapes, with some rotation. So the correct answer is Option 2, the cross sign.

GPT 5.4:
It picked Option 1, because it searched for the shape that had not yet appeared outside and treated the puzzle like a missing inventory problem.

Claude Sonnet:
It also picked Option 1, but for a column-based simplicity rule that did not really match the actual abstraction behind the puzzle.

That is what makes this interesting.

The issue is not only that models fail.
The issue is how they fail.

They often do not fail like a human who misunderstood the concept. They fail like a system that never truly formed the concept in the first place.

And this is exactly why benchmarks like ARC matter.

The Abstraction and Reasoning Corpus (ARC) was designed to test something deeper than memorization or benchmark gaming. The whole point is a few-shot abstraction: seeing a tiny number of examples, discovering the hidden rule, and applying it to a new case. It is meant to probe core concepts like objectness, size, symmetry, containment, sameness, difference, and spatial structure.

That is a very different skill from “I have seen something statistically similar on the internet.”

And that difference matters a lot.

Because recent progress on ARC does not automatically mean models are suddenly reasoning like humans. One of the strongest critiques around the ARC race is that better scores may reflect better search, sampling, program generation, and test-time revision, rather than genuine abstraction. Melanie Mitchell describes methods that generate thousands of candidate programs, revise the promising ones, and then vote on the output. Impressive? Yes. But that looks a lot more like brute-force guided search than the compact concept formation humans usually rely on.

That point is important for software engineering too.

Because in real engineering work, you do not always get 5,000 attempts and a voting system.

You need to understand the business rule.
You need to notice the hidden constraint.
You need to see that two solutions both work, but one will create a maintenance nightmare three months later.
You need to understand the shape of the problem, not just generate possible answers until one sticks.

That is why “AI writes code” is still a very shallow way to evaluate whether AI can replace engineers.

A much better question is:

Does it understand the problem for the right reason?

And that is exactly what newer research is challenging.

A recent paper, Do AI Models Perform Human-like Abstract Reasoning Across Modalities?, makes a very useful distinction: output accuracy alone is not enough. The researchers did not just check whether the model got the final answer right. They also looked at the rule the model claimed to be using. Their conclusion is uncomfortable for the hype cycle: in text-based settings, top models often reach strong accuracy while relying on shallow shortcuts and unintended patterns much more than humans do. In visual settings, their final answers get much worse, even though some of their described rules still capture part of the intended abstraction. In other words: accuracy can overestimate reasoning in some settings and underestimate it in others.

This is a huge point.

Because it means a model can be:

correct for the wrong reason
wrong for the partly right reason
impressive on the benchmark and still weak in actual abstraction

That is not a small technical detail.
That is the whole game.

In the ConceptARC analysis, researchers compared human-written rules with model-written rules and found that models often described problems in terms of rows, columns, pixel values, and local surface features, while humans more naturally used object-level concepts. They also introduced a helpful category: “correct but unintended.” That means the model gives a rule that works on the examples, but does not capture the actual abstraction the task was designed to test.

Honestly, that sounds very familiar to anyone who has worked with AI coding tools.

You ask for a feature.
The model gives you something that works.
You test it. It passes.
Then two days later you realize it solved the local output, not the real problem.

That is why I do not buy the simple claim that AI will replace Software Engineers.

AI can absolutely replace some tasks done by developers.
It can speed up implementation.
It can remove boring work.
It can help juniors move faster.
It can even make strong developers much more dangerous in a good way.

But replacing software engineering is another story.

Because software engineering is not just producing outputs. It is reasoning under ambiguity. It is deciding what matters. It is building mental models of systems, teams, constraints, users, and future consequences.

And right now, AI still looks much stronger at compressed output generation than at stable abstract understanding.

So no, I do not think AI models are ready to replace Software Engineers.

They are powerful assistants.
They are accelerators.
They are idea multipliers.
Sometimes they are even shockingly good.

But when the problem becomes abstract, multi-layered, and slightly unfamiliar, the illusion starts to crack.

And maybe that is the real takeaway:

The danger is not that AI is already reasoning like a great engineer.

The danger is that it often looks like it is.

References

IQ test for AI models (ARC benchmark)

How Do You Know Your AI Is Actually Good?

Autonomous Hackers: When AI Reads CVEs Before You Do

From a Python notebook to a Serverless AI application in production with GCP

Find a better way to
customAI solutions

IQ test for AI models (ARC benchmark)

Find a better way to customAI solutions

Find a better way to
customAI solutions