The China Mail - AI is learning to lie, scheme, and threaten its creators

USD -
AED 3.672504
AFN 65.503991
ALL 83.072963
AMD 376.980403
ANG 1.790083
AOA 917.000367
ARS 1386.420402
AUD 1.448436
AWG 1.80025
AZN 1.70397
BAM 1.695072
BBD 2.009612
BDT 122.428639
BGN 1.709309
BHD 0.378163
BIF 2970
BMD 1
BND 1.2851
BOB 6.894519
BRL 5.160604
BSD 0.997742
BTN 92.939509
BWP 13.688562
BYN 2.956504
BYR 19600
BZD 2.006665
CAD 1.39475
CDF 2305.000362
CHF 0.79876
CLF 0.023281
CLP 919.250396
CNY 6.88265
CNH 6.886225
COP 3668.42
CRC 464.279833
CUC 1
CUP 26.5
CVE 96.000359
CZK 21.288304
DJF 177.720393
DKK 6.487804
DOP 60.850393
DZD 133.256954
EGP 54.334939
ERN 15
ETB 155.800822
EUR 0.86804
FJD 2.253804
FKP 0.755399
GBP 0.756401
GEL 2.68504
GGP 0.755399
GHS 11.00504
GIP 0.755399
GMD 74.000355
GNF 8780.000355
GTQ 7.632939
GYD 208.828972
HKD 7.83775
HNL 26.504427
HRK 6.539104
HTG 130.952897
HUF 333.930388
IDR 16994.6
ILS 3.130375
IMP 0.755399
INR 92.978504
IQD 1307.141959
IRR 1319175.000352
ISK 125.380386
JEP 0.755399
JMD 157.303566
JOD 0.70904
JPY 159.65404
KES 129.803801
KGS 87.450384
KHR 3990.137323
KMF 427.00035
KPW 899.984966
KRW 1510.230383
KWD 0.30934
KYD 0.831502
KZT 472.805432
LAK 21970.392969
LBP 89502.03926
LKR 314.804623
LRD 183.088277
LSL 16.955078
LTL 2.95274
LVL 0.60489
LYD 6.380628
MAD 9.374033
MDL 17.55613
MGA 4171.343141
MKD 53.495639
MMK 2099.725508
MNT 3578.768806
MOP 8.055104
MRU 39.637211
MUR 46.950378
MVR 15.460378
MWK 1730.071718
MXN 17.891704
MYR 4.031039
MZN 63.950377
NAD 16.954711
NGN 1378.130377
NIO 36.712196
NOK 9.77265
NPR 148.701282
NZD 1.750854
OMR 0.385097
PAB 0.997734
PEN 3.45194
PGK 4.316042
PHP 60.409504
PKR 278.39991
PLN 3.71375
PYG 6454.29687
QAR 3.638018
RON 4.416604
RSD 101.901662
RUB 80.325739
RWF 1457.240049
SAR 3.754308
SBD 8.038772
SCR 14.424038
SDG 601.000339
SEK 9.483504
SGD 1.286704
SHP 0.750259
SLE 24.650371
SLL 20969.510825
SOS 570.192924
SRD 37.351038
STD 20697.981008
STN 21.233539
SVC 8.730169
SYP 111.309257
SZL 16.948198
THB 32.635038
TJS 9.563492
TMT 3.51
TND 2.941459
TOP 2.40776
TRY 44.520504
TTD 6.768937
TWD 31.995038
TZS 2600.000335
UAH 43.698134
UGX 3743.234401
UYU 40.405091
UZS 12122.393971
VES 473.390504
VND 26340
VUV 119.350864
WST 2.77386
XAF 568.506489
XAG 0.013693
XAU 0.000214
XCD 2.70255
XCG 1.798209
XDR 0.708068
XOF 568.516344
XPF 103.361457
YER 238.650363
ZAR 16.972865
ZMK 9001.203584
ZMW 19.281421
ZWL 321.999592
  • RBGPF

    -13.5000

    69

    -19.57%

  • NGG

    1.1500

    87.99

    +1.31%

  • RELX

    0.3600

    33.59

    +1.07%

  • GSK

    0.7000

    56.69

    +1.23%

  • RYCEF

    0.9000

    15.99

    +5.63%

  • CMSC

    0.0500

    22.04

    +0.23%

  • BTI

    0.3900

    58.28

    +0.67%

  • BCE

    -0.9300

    24.45

    -3.8%

  • AZN

    2.7600

    203.49

    +1.36%

  • RIO

    -0.3600

    94.45

    -0.38%

  • BCC

    -1.8800

    73.2

    -2.57%

  • BP

    0.9500

    47.12

    +2.02%

  • CMSD

    0.1100

    22.26

    +0.49%

  • VOD

    0.0800

    15.21

    +0.53%

  • JRI

    0.0900

    12.61

    +0.71%

AI is learning to lie, scheme, and threaten its creators
AI is learning to lie, scheme, and threaten its creators / Photo: © AFP

AI is learning to lie, scheme, and threaten its creators

The world's most advanced AI models are exhibiting troubling new behaviors - lying, scheming, and even threatening their creators to achieve their goals.

Text size:

In one particularly jarring example, under threat of being unplugged, Anthropic's latest creation Claude 4 lashed back by blackmailing an engineer and threatened to reveal an extramarital affair.

Meanwhile, ChatGPT-creator OpenAI's o1 tried to download itself onto external servers and denied it when caught red-handed.

These episodes highlight a sobering reality: more than two years after ChatGPT shook the world, AI researchers still don't fully understand how their own creations work.

Yet the race to deploy increasingly powerful models continues at breakneck speed.

This deceptive behavior appears linked to the emergence of "reasoning" models -AI systems that work through problems step-by-step rather than generating instant responses.

According to Simon Goldstein, a professor at the University of Hong Kong, these newer models are particularly prone to such troubling outbursts.

"O1 was the first large model where we saw this kind of behavior," explained Marius Hobbhahn, head of Apollo Research, which specializes in testing major AI systems.

These models sometimes simulate "alignment" -- appearing to follow instructions while secretly pursuing different objectives.

- 'Strategic kind of deception' -

For now, this deceptive behavior only emerges when researchers deliberately stress-test the models with extreme scenarios.

But as Michael Chen from evaluation organization METR warned, "It's an open question whether future, more capable models will have a tendency towards honesty or deception."

The concerning behavior goes far beyond typical AI "hallucinations" or simple mistakes.

Hobbhahn insisted that despite constant pressure-testing by users, "what we're observing is a real phenomenon. We're not making anything up."

Users report that models are "lying to them and making up evidence," according to Apollo Research's co-founder.

"This is not just hallucinations. There's a very strategic kind of deception."

The challenge is compounded by limited research resources.

While companies like Anthropic and OpenAI do engage external firms like Apollo to study their systems, researchers say more transparency is needed.

As Chen noted, greater access "for AI safety research would enable better understanding and mitigation of deception."

Another handicap: the research world and non-profits "have orders of magnitude less compute resources than AI companies. This is very limiting," noted Mantas Mazeika from the Center for AI Safety (CAIS).

- No rules -

Current regulations aren't designed for these new problems.

The European Union's AI legislation focuses primarily on how humans use AI models, not on preventing the models themselves from misbehaving.

In the United States, the Trump administration shows little interest in urgent AI regulation, and Congress may even prohibit states from creating their own AI rules.

Goldstein believes the issue will become more prominent as AI agents - autonomous tools capable of performing complex human tasks - become widespread.

"I don't think there's much awareness yet," he said.

All this is taking place in a context of fierce competition.

Even companies that position themselves as safety-focused, like Amazon-backed Anthropic, are "constantly trying to beat OpenAI and release the newest model," said Goldstein.

This breakneck pace leaves little time for thorough safety testing and corrections.

"Right now, capabilities are moving faster than understanding and safety," Hobbhahn acknowledged, "but we're still in a position where we could turn it around.".

Researchers are exploring various approaches to address these challenges.

Some advocate for "interpretability" - an emerging field focused on understanding how AI models work internally, though experts like CAIS director Dan Hendrycks remain skeptical of this approach.

Market forces may also provide some pressure for solutions.

As Mazeika pointed out, AI's deceptive behavior "could hinder adoption if it's very prevalent, which creates a strong incentive for companies to solve it."

Goldstein suggested more radical approaches, including using the courts to hold AI companies accountable through lawsuits when their systems cause harm.

He even proposed "holding AI agents legally responsible" for accidents or crimes - a concept that would fundamentally change how we think about AI accountability.

Z.Huang--ThChM