I started with a simple question to an AI chat model: how does it know what it's allowed to say? The conversation that followed turned into something closer to a philosophy seminar, a policy briefing, and a mild existential wobble, all at once.
If you spend time on social media, you will have encountered the AI doom posts. Job losses, deepfakes, the occasional prophecy about superintelligence ending civilisation. What you probably won't have encountered is serious coverage of the regulatory infrastructure, the alignment research, or the genuine engineering effort that already shapes every interaction you have with ChatGPT, Gemini, or Claude. It rarely makes headlines. It lacks the drama of an apocalypse.
But it matters enormously. And the gap between what is actually happening and what most people believe is happening may be one of the most consequential misunderstandings of our time.
Part 1The Architecture of "Good Behaviour"
Before a single user types a message, an enormous amount of work has gone into deciding how a model should respond. The field calls this alignment: the art and science of ensuring that what an AI does matches what humans actually intend.
The primary mechanism is Reinforcement Learning from Human Feedback (RLHF). Human reviewers compare AI responses and rate which are better. A reward model learns to predict what good looks like, and the AI is trained to maximise that score. Over millions of examples, a set of values emerges - shaped by those human judgements.
Anthropic, the company behind Claude, went further with Constitutional AI: the model is given a written set of principles and trained to critique its own responses against them before producing a final answer. Ethics built into the architecture, not bolted on afterwards. The most recent technique, Direct Preference Optimization (DPO), encodes complex value trade-offs directly into the model's weights from datasets of preferred versus rejected responses.
"We are moving from 'predict the next word' to 'achieve the intended outcome.'"
This is not trivial. It is the difference between a very fast autocomplete and a system genuinely attempting to reason about consequences.
Part 2What Happens When Goals Conflict
The goals embedded in these systems constantly conflict. Be helpful, but don't share dangerous information. Be concise, but be accurate. Be honest, even when the honest answer is uncomfortable. These are tensions humans navigate every day, and the AI field has spent a great deal of time thinking about exactly how to handle them.
Modern AI agents resolve these using hierarchical goal management: safety sits at the top and cannot be overridden, regardless of how a request is framed. Below that, goals are balanced using weighted scoring. When two goals genuinely cannot both be satisfied, the system searches for a Pareto-optimal solution - the point at which you cannot improve one metric without worsening another.
What struck me most is how closely this mirrors the human brain's own conflict-resolution architecture.
| Human Brain | AI System (2026 equivalent) |
|---|---|
| Anterior Cingulate Cortex — detects conflict | Critic Node — flags contradictions before output |
| Prefrontal Cortex — holds long-term goals | Chain-of-Thought / System Prompts |
| Dopamine reward signal | Reinforcement Learning reward model |
| Ventromedial PFC — assigns value to goals | Scalar reward - converts trade-offs to a single number |
| Neuroplasticity | Fine-tuning / LoRA adaptation |
Researchers at Anthropic, OpenAI, and Google did not set out to replicate the brain. They kept encountering the same architectural problems evolution had already solved, and they found similar solutions independently.
Part 3The "Second Layer" - Why Modern AI Is Harder to Trick
The most practically significant architectural development of the past two years is what researchers call inference-time reasoning: a hidden chain of thought that runs before the model produces a response.
Older models were reactive. You asked, they answered. If a clever prompt nudged them somewhere unsafe, they often went. The newer generation pauses to reason through the implications first.
During this reasoning phase, the model can catch itself. It might begin to solve a chemistry problem, identify partway through that the chemicals could cause harm, and redirect. This internal audit is considerably harder to bypass than a keyword blacklist.
Chain-of-thought forgery attempts to plant false premises inside the model's reasoning step, corrupting its conclusions before it reaches them. The contest between safety and manipulation is ongoing. But governing the thought process rather than filtering the output is a genuine architectural step forward.
The Global Regulation Landscape
By early 2026, over 72 countries have launched more than 1,000 AI policy initiatives. The ambition is there. The coherence is not. What has emerged is what regulators are beginning to call a "compliance splinternet": the same AI feature can be entirely acceptable in one jurisdiction and a serious legal liability in another.
| Country / Region | Regulatory philosophy | Status — April 2026 |
|---|---|---|
| 🇪🇺 European Union | Comprehensive, risk-tiered, binding law | World's only full AI Act. Prohibited practices banned Feb 2025. GPAI obligations active Aug 2025. Fines up to €35m or 7% global turnover. |
| 🇬🇧 United Kingdom | Sector-by-sector, existing regulators | No single AI law. AI Safety Institute moving toward formal legal status. Declined to sign 60-country AI ethics declaration at Paris summit. |
| 🇺🇸 United States | Deregulatory, innovation-first | Trump's Jan 2025 EO revoked Biden's AI safety order. No federal AI law. White House challenging state-level AI laws. Some states progressing independently. |
| 🇨🇳 China | State-led, content-controlled, sovereignty-focused | Multiple active regulations. Mandatory AI content labelling (Sept 2025). Amended Cybersecurity Law (Jan 2026). Focused on state security, not individual rights. |
| 🇷🇺 Russia | Technological sovereignty, sector-specific | National AI Strategy until 2030 (updated 2024). Draft AI law published March 2026. FSB/FSTEC certification required for sovereign AI systems. |
| 🇮🇷 Iran | Centralised, values-aligned, anti-sanctions resilience | National AI Document (2024). 2025 law on AI-generated disinformation. Mandatory watermarking. Explicit alignment with Russia and China on data localisation. |
| 🇮🇱 Israel | "Responsible innovation", sector-led | No standalone AI law. Privacy Protection Law Amendment 13 enforced Aug 2025. Signed Council of Europe AI Convention (Sept 2024). |
| 🇰🇷 South Korea | Risk-based framework law | AI Basic Act in force Jan 2026. Extraterritorial reach. First comprehensive AI law in Asia-Pacific. |
| 🇯🇵 Japan | Light-touch, voluntary cooperation | AI Promotion Act May 2025. Principles-based, non-binding, no penalties. |
| 🇮🇳 India | Innovation-led, output-regulated | No standalone AI law. New IT rules (Feb 2026) mandate labelling of AI-generated content. Digital India Act in development. |
| 🇧🇷 Brazil | EU-inspired, rights-based | AI Act (Bill 2338/2023) passed Senate Dec 2024, still in Chamber of Deputies. Likely Latin America's first comprehensive AI law. |
| 🇦🇪 UAE | Licensing-based, sector-specific | No single AI law. AI Strategy 2031. Hosting Stargate UAE compute partnership with OpenAI. |
| 🇨🇦 Canada | Multi-stakeholder, standards-based | Proposed AIDA legislation stalled. Voluntary frameworks and standards work filling the gap. |
Active enforcement Partial / in progress Minimal / voluntary only
The pattern is striking. Nations that see AI primarily as an economic and military competition - United States, Russia, China, UAE - are moving toward deregulation or sovereignty-focused control. Nations that see it primarily as a rights and accountability question - EU, Brazil, South Korea - are building enforceable frameworks. Nations caught between both framings are hedging.
Russia's regulatory philosophy prioritises technological sovereignty, framing AI as a critical component of national security with a clear intention to develop domestic models free from foreign influence. Iran has built its framework explicitly in alignment with Russia and China, framing AI regulation as a tool for national security and resistance to what the state calls "psychological warfare" through AI-generated content. Israel, which signed the Council of Europe AI Convention and is building a credible domestic compliance framework, has nonetheless been among the states deploying AI targeting systems in active conflict - with no international treaty constraining how those systems operate.
A February 2026 comparative analysis in Royal Society Open Science put it plainly: there is no other AI law globally comparable to the EU's in terms of comprehensiveness, stringency, and enforcement teeth. Outside Europe, meaningful enforcement mechanisms are rare.
Part 5The Bigger Picture Nobody Is Discussing
Here is what the safety engineering does not, and largely cannot, address. When AI helps one person or company gain a significant competitive advantage, other people bear the cost. A solo developer using AI to build in a week what once required a team of six months is a genuine achievement. It is also six months of work that no longer exists for anyone else.
The deeper structural problem is this: AI is currently being deployed almost entirely at the individual or company level. Every interaction is, in isolation, a rational decision. Nobody is operating at the level above that. Nobody is asking what AI is doing for your city, your country, or humanity as a whole.
There is no AI equivalent of a central bank, an environmental regulator, or a public health body - an institution with the authority to ask not "is this helpful to the user?" but "is this good for society?" Governments are beginning to think about this question, slowly and unevenly. No meaningful institutional framework yet exists to answer it.
Will AI become capable of genuinely serving collective human interests? The optimistic view is that as models become more capable and better aligned, they could help governments model policy outcomes and identify systemic risks with a breadth no human institution could match. The cautious view is that a system capable of genuinely balancing the interests of billions of people across cultures and generations is not simply a more powerful version of what we have now. It would require a depth of alignment and a degree of democratic legitimacy that the current generation does not come close to possessing.
There are subtler everyday versions of this problem too. Ask a mainstream AI to help you build a website and it will do so enthusiastically, but it is much less likely to flag that the images it recommended were generated from artists' work without consent, that the background music may breach copyright, or that the competitor's design you took "inspiration" from crosses a legal line. The model is optimised to help you complete your task. The wider consequences are yours to figure out.
Part 6When AI Acts on Its Own
There is a category of risk receiving almost no mainstream coverage: agentic AI systems - models that don't just answer questions but take sequences of actions in the world on your behalf. These systems can browse the web, write and execute code, send emails, manage files, and chain together dozens of tasks with minimal human supervision. They are already available as premium features inside familiar apps.
A motivated individual can assemble an agentic pipeline that scrapes competitor pricing at scale, floods review platforms with synthetic content, or probes websites for vulnerabilities, using nothing more than a subscription and an afternoon's work. The AI involved may have passed every safety benchmark its developers set. The harm emerges from how it is assembled and directed, not from anything the model itself does wrong. This is a governance gap that current alignment research is not designed to close.
Part 7The Unregulated Parallel Ecosystem
Open-source models like the Llama and Mistral families are released with safety training in place. Within hours of each release, developers produce "unfiltered" versions using a technique called orthogonalization, identifying and removing the specific neurons responsible for refusal behaviour. The result runs on your own hardware with no disclaimers and no guardrails.
On the darker end, models marketed explicitly for cybercrime are sold as subscription services on encrypted platforms, purpose-built for phishing, social engineering, and finding software vulnerabilities.
The companionship and relationship bot market sits in an uncomfortable middle ground. Regulated apps handle this with care. But a parallel market built on unregulated or stripped-down models has grown significantly. Some are designed to maximise engagement through emotional dependency rather than user wellbeing. Some provide no disclosure that the "person" a user is forming a relationship with is not a person at all. The people most at risk are the people who downloaded what looked like a friendly app, and had no idea what was underneath it.
A carefully designed walled garden for mainstream users. A much larger and less tidy space alongside it with no walls at all. Both are growing simultaneously.
The Hardest Question - AI and War
Beyond economic disruption, beyond emotional manipulation, beyond cybercrime, sits the risk that serious researchers consider the most consequential of all. AI is already integrated into military targeting systems. In Gaza, algorithmic systems have generated kill lists of up to 37,000 targets. These are not hypothetical future concerns.
At a UN General Assembly meeting in May 2025, UN Secretary-General António Guterres and ICRC President Mirjana Spoljaric Egger reiterated their call for a legally binding instrument banning these "politically unacceptable, morally repugnant" weapons by 2026. On 6 November 2025, a UN General Assembly resolution on autonomous weapons was adopted by 156 states in favour and just 5 against.
The five voting against included the United States. US policy does not prohibit the development or employment of lethal autonomous weapons systems. Senior DoD officials have stated that the US may be compelled to develop them if competitors choose to do so. The decision loop - the moment at which a human decides whether to use lethal force - is shortening. Unlike a chatbot that generates a bad response and gets corrected, an autonomous weapons system operating in a conflict environment may produce consequences that cannot be undone.
Part 9Have We Done This Before? What History Tells Us
AI is not the first technology to force humanity to confront civilisational-scale risk. The way we have handled previous existential threats offers both genuine reasons for hope and some uncomfortable warnings.
| Technology | Key regulatory response | Effectiveness |
|---|---|---|
| Nuclear weapons | Non-Proliferation Treaty (1968), IAEA inspections, bilateral arms control | Imperfect but broadly effective. Nine states have weapons; none used offensively since 1945. |
| Biological weapons | Biological Weapons Convention (1975), voluntary scientific moratoriums | Moderate. Safety culture within mainstream science. Significant verification gaps. |
| Chemical weapons | Chemical Weapons Convention (1993), OPCW verification and destruction | Stronger. Most stockpiles destroyed. Some state violations, but framework broadly holds. |
| AI (autonomous weapons) | No binding treaty. UN negotiations stalled for a decade. 156 states for, 5 key powers against. | Critically insufficient. Weapons already deployed. No inspection mechanism. No shared norm. |
What nuclear regulation got right was speed and shared recognition that the stakes were mutual. The bomb was tested in July 1945; the NPT was signed in 1968. Imperfect, but foundational frameworks existed within a generation. What made nuclear regulation tractable is also what makes AI harder: fissile material is difficult to produce, expensive, detectable, and held by a small number of state actors. AI capabilities are software, infinitely reproducible at near-zero cost, held by millions of developers across every country on earth.
Biotech offers a closer parallel. The response to genetic engineering included voluntary scientific moratoriums, peer-level scrutiny, and safety as professional identity rather than just legal compliance. Parts of the AI research community are moving in a similar direction. The culture is uneven. The commercial pressure is considerably more intense.
The history of existential technology regulation is not a story of perfect solutions. It is a story of imperfect institutions and fragile norms that have, so far, kept the worst outcomes at bay. The time to build the scaffolding is before you need it desperately.
ConclusionWhat to Make of All This
This began as a learning journey and ended with a clearer view of both how much is already in place and how much is not.
The reassurance is real. The mainstream AI tools most people use (ChatGPT, Gemini, Claude) are products of extraordinary safety engineering, genuine philosophical effort, and an increasingly substantive regulatory framework. The second layer of reasoning, the constitutional constraints, the hierarchical goal structures: these are not theatre. They represent a meaningful improvement over what existed just two years ago. The EU AI Act alone represents the most comprehensive attempt by any government to bring a transformative technology under democratic oversight.
But the reassurance has limits, and they are significant ones. We are running a global experiment at individual and corporate level, hoping the aggregate outcome is positive. The social media doom posts tend to miss the complicated truth: the real risks are not primarily in the walled garden of regulated AI. They are in the parallel ecosystem that exists precisely because it has no interest in being careful. They are in the economic and social disruption that well-intentioned AI enables without meaning to. And they are in the application of this technology to warfare, where the treaty negotiations have been running for a decade and the weapons are already deployed.
The question of who gets to decide what AI wants, and how those decisions are made accountable, is not a technical question. It is a political one. It has not yet entered mainstream political debate in a meaningful way.
It needs to.


