AI is learning to lie, scheme, and threaten its creators

The China Mail - AI is learning to lie, scheme, and threaten its creators

Beijing 7°C

USD -

AED 3.672503

AFN 66.358865

ALL 83.521386

AMD 382.507047

ANG 1.789982

AOA 916.999942

ARS 1420.001095

AUD 1.532297

AWG 1.8075

AZN 1.700215

BAM 1.69102

BBD 2.013765

BDT 122.075429

BGN 1.69038

BHD 0.376985

BIF 2944.950242

BMD 1

BND 1.302709

BOB 6.934237

BRL 5.288594

BSD 0.999836

BTN 88.626912

BWP 13.379849

BYN 3.408468

BYR 19600

BZD 2.010825

CAD 1.402695

CDF 2507.503045

CHF 0.801795

CLF 0.023892

CLP 937.280025

CNY 7.11965

CNH 7.121545

COP 3768.72

CRC 501.990757

CUC 1

CUP 26.5

CVE 95.337115

CZK 20.97225

DJF 178.040619

DKK 6.453275

DOP 64.274876

DZD 130.334215

EGP 47.2332

ERN 15

ETB 153.531271

EUR 0.86414

FJD 2.2795

FKP 0.760151

GBP 0.76071

GEL 2.704944

GGP 0.760151

GHS 10.938284

GIP 0.760151

GMD 73.493505

GNF 8679.111511

GTQ 7.663975

GYD 209.177056

HKD 7.773075

HNL 26.305664

HRK 6.510503

HTG 130.902048

HUF 333.164946

IDR 16717.4

ILS 3.217055

IMP 0.760151

INR 88.53915

IQD 1309.809957

IRR 42112.502065

ISK 126.509901

JEP 0.760151

JMD 160.929279

JOD 0.709026

JPY 154.216503

KES 129.120362

KGS 87.449766

KHR 4015.251731

KMF 421.000542

KPW 899.978423

KRW 1464.569693

KWD 0.307097

KYD 0.833232

KZT 523.811582

LAK 21710.560445

LBP 89534.40718

LKR 304.034308

LRD 182.9689

LSL 17.183334

LTL 2.95274

LVL 0.604891

LYD 5.455693

MAD 9.256256

MDL 16.972307

MGA 4491.671602

MKD 53.199952

MMK 2099.547411

MNT 3580.914225

MOP 8.005153

MRU 39.702748

MUR 45.889881

MVR 15.405021

MWK 1733.71722

MXN 18.36573

MYR 4.138985

MZN 63.949746

NAD 17.183334

NGN 1437.069362

NIO 36.789182

NOK 10.08201

NPR 141.802446

NZD 1.770055

OMR 0.384485

PAB 0.999844

PEN 3.374604

PGK 4.221029

PHP 58.961021

PKR 282.700265

PLN 3.65467

PYG 7082.89022

QAR 3.644192

RON 4.393097

RSD 101.25215

RUB 81.322855

RWF 1453.231252

SAR 3.750481

SBD 8.237372

SCR 13.77609

SDG 600.496166

SEK 9.485902

SGD 1.30182

SHP 0.750259

SLE 23.194491

SLL 20969.499529

SOS 570.381162

SRD 38.496501

STD 20697.981008

STN 21.18296

SVC 8.748206

SYP 11056.693449

SZL 17.178084

THB 32.402502

TJS 9.263432

TMT 3.5

TND 2.951633

TOP 2.342104

TRY 42.23324

TTD 6.782064

TWD 31.013798

TZS 2450.602922

UAH 42.041441

UGX 3509.484861

UYU 39.780907

UZS 12013.003856

VES 230.803902

VND 26315

VUV 122.395188

WST 2.82323

XAF 567.14739

XAG 0.019568

XAU 0.000242

XCD 2.70255

XCG 1.801951

XDR 0.705352

XOF 567.14739

XPF 103.114354

YER 238.509303

ZAR 17.15325

ZMK 9001.201907

ZMW 22.620808

ZWL 321.999592

RYCEF

0.0200

14.82

+0.13%
CMSC

0.0400

23.89

+0.17%
RIO

0.9600

70.29

+1.37%
RBGPF

0.0000

76

0%
SCS

-0.0200

15.74

-0.13%
BCC

-0.8100

69.83

-1.16%
CMSD

0.0600

24.16

+0.25%
BCE

-0.2500

22.94

-1.09%
VOD

0.1200

11.7

+1.03%
NGG

-0.4200

77.33

-0.54%
GSK

0.7300

47.36

+1.54%
BTI

0.8300

55.42

+1.5%
RELX

-0.2400

42.03

-0.57%
BP

0.5400

37.12

+1.45%
JRI

-0.0600

13.68

-0.44%
AZN

2.9000

87.48

+3.32%

AI is learning to lie, scheme, and threaten its creators

ECONOMY 29.06.2025

The world's most advanced AI models are exhibiting troubling new behaviors - lying, scheming, and even threatening their creators to achieve their goals.

Text size:

In one particularly jarring example, under threat of being unplugged, Anthropic's latest creation Claude 4 lashed back by blackmailing an engineer and threatened to reveal an extramarital affair.

Meanwhile, ChatGPT-creator OpenAI's o1 tried to download itself onto external servers and denied it when caught red-handed.

These episodes highlight a sobering reality: more than two years after ChatGPT shook the world, AI researchers still don't fully understand how their own creations work.

Yet the race to deploy increasingly powerful models continues at breakneck speed.

This deceptive behavior appears linked to the emergence of "reasoning" models -AI systems that work through problems step-by-step rather than generating instant responses.

According to Simon Goldstein, a professor at the University of Hong Kong, these newer models are particularly prone to such troubling outbursts.

"O1 was the first large model where we saw this kind of behavior," explained Marius Hobbhahn, head of Apollo Research, which specializes in testing major AI systems.

These models sometimes simulate "alignment" -- appearing to follow instructions while secretly pursuing different objectives.

- 'Strategic kind of deception' -

For now, this deceptive behavior only emerges when researchers deliberately stress-test the models with extreme scenarios.

But as Michael Chen from evaluation organization METR warned, "It's an open question whether future, more capable models will have a tendency towards honesty or deception."

The concerning behavior goes far beyond typical AI "hallucinations" or simple mistakes.

Hobbhahn insisted that despite constant pressure-testing by users, "what we're observing is a real phenomenon. We're not making anything up."

Users report that models are "lying to them and making up evidence," according to Apollo Research's co-founder.

"This is not just hallucinations. There's a very strategic kind of deception."

The challenge is compounded by limited research resources.

While companies like Anthropic and OpenAI do engage external firms like Apollo to study their systems, researchers say more transparency is needed.

As Chen noted, greater access "for AI safety research would enable better understanding and mitigation of deception."

Another handicap: the research world and non-profits "have orders of magnitude less compute resources than AI companies. This is very limiting," noted Mantas Mazeika from the Center for AI Safety (CAIS).

- No rules -

Current regulations aren't designed for these new problems.

The European Union's AI legislation focuses primarily on how humans use AI models, not on preventing the models themselves from misbehaving.

In the United States, the Trump administration shows little interest in urgent AI regulation, and Congress may even prohibit states from creating their own AI rules.

Goldstein believes the issue will become more prominent as AI agents - autonomous tools capable of performing complex human tasks - become widespread.

"I don't think there's much awareness yet," he said.

All this is taking place in a context of fierce competition.

Even companies that position themselves as safety-focused, like Amazon-backed Anthropic, are "constantly trying to beat OpenAI and release the newest model," said Goldstein.

This breakneck pace leaves little time for thorough safety testing and corrections.

"Right now, capabilities are moving faster than understanding and safety," Hobbhahn acknowledged, "but we're still in a position where we could turn it around.".

Researchers are exploring various approaches to address these challenges.

Some advocate for "interpretability" - an emerging field focused on understanding how AI models work internally, though experts like CAIS director Dan Hendrycks remain skeptical of this approach.

Market forces may also provide some pressure for solutions.

As Mazeika pointed out, AI's deceptive behavior "could hinder adoption if it's very prevalent, which creates a strong incentive for companies to solve it."

Goldstein suggested more radical approaches, including using the courts to hold AI companies accountable through lawsuits when their systems cause harm.

He even proposed "holding AI agents legally responsible" for accidents or crimes - a concept that would fundamentally change how we think about AI accountability.

Z.Huang--ThChM

The China Mail - AI is learning to lie, scheme, and threaten its creators

AI is learning to lie, scheme, and threaten its creators

Featured

US-China tensions weigh on Lisbon's Web Summit

Stocks mostly rise as end to US shutdown appears closer

'Splinternets' threat to be avoided, says web address controller

China's 'Singles Day' shopping fest loses its shine for weary consumers