The China Mail - AI is learning to lie, scheme, and threaten its creators

USD -
AED 3.672503
AFN 66.358865
ALL 83.521386
AMD 382.507047
ANG 1.789982
AOA 916.999942
ARS 1420.001095
AUD 1.532297
AWG 1.8075
AZN 1.700215
BAM 1.69102
BBD 2.013765
BDT 122.075429
BGN 1.69038
BHD 0.376985
BIF 2944.950242
BMD 1
BND 1.302709
BOB 6.934237
BRL 5.288594
BSD 0.999836
BTN 88.626912
BWP 13.379849
BYN 3.408468
BYR 19600
BZD 2.010825
CAD 1.402695
CDF 2507.503045
CHF 0.801795
CLF 0.023892
CLP 937.280025
CNY 7.11965
CNH 7.121545
COP 3768.72
CRC 501.990757
CUC 1
CUP 26.5
CVE 95.337115
CZK 20.97225
DJF 178.040619
DKK 6.453275
DOP 64.274876
DZD 130.334215
EGP 47.2332
ERN 15
ETB 153.531271
EUR 0.86414
FJD 2.2795
FKP 0.760151
GBP 0.76071
GEL 2.704944
GGP 0.760151
GHS 10.938284
GIP 0.760151
GMD 73.493505
GNF 8679.111511
GTQ 7.663975
GYD 209.177056
HKD 7.773075
HNL 26.305664
HRK 6.510503
HTG 130.902048
HUF 333.164946
IDR 16717.4
ILS 3.217055
IMP 0.760151
INR 88.53915
IQD 1309.809957
IRR 42112.502065
ISK 126.509901
JEP 0.760151
JMD 160.929279
JOD 0.709026
JPY 154.216503
KES 129.120362
KGS 87.449766
KHR 4015.251731
KMF 421.000542
KPW 899.978423
KRW 1464.569693
KWD 0.307097
KYD 0.833232
KZT 523.811582
LAK 21710.560445
LBP 89534.40718
LKR 304.034308
LRD 182.9689
LSL 17.183334
LTL 2.95274
LVL 0.604891
LYD 5.455693
MAD 9.256256
MDL 16.972307
MGA 4491.671602
MKD 53.199952
MMK 2099.547411
MNT 3580.914225
MOP 8.005153
MRU 39.702748
MUR 45.889881
MVR 15.405021
MWK 1733.71722
MXN 18.36573
MYR 4.138985
MZN 63.949746
NAD 17.183334
NGN 1437.069362
NIO 36.789182
NOK 10.08201
NPR 141.802446
NZD 1.770055
OMR 0.384485
PAB 0.999844
PEN 3.374604
PGK 4.221029
PHP 58.961021
PKR 282.700265
PLN 3.65467
PYG 7082.89022
QAR 3.644192
RON 4.393097
RSD 101.25215
RUB 81.322855
RWF 1453.231252
SAR 3.750481
SBD 8.237372
SCR 13.77609
SDG 600.496166
SEK 9.485902
SGD 1.30182
SHP 0.750259
SLE 23.194491
SLL 20969.499529
SOS 570.381162
SRD 38.496501
STD 20697.981008
STN 21.18296
SVC 8.748206
SYP 11056.693449
SZL 17.178084
THB 32.402502
TJS 9.263432
TMT 3.5
TND 2.951633
TOP 2.342104
TRY 42.23324
TTD 6.782064
TWD 31.013798
TZS 2450.602922
UAH 42.041441
UGX 3509.484861
UYU 39.780907
UZS 12013.003856
VES 230.803902
VND 26315
VUV 122.395188
WST 2.82323
XAF 567.14739
XAG 0.019568
XAU 0.000242
XCD 2.70255
XCG 1.801951
XDR 0.705352
XOF 567.14739
XPF 103.114354
YER 238.509303
ZAR 17.15325
ZMK 9001.201907
ZMW 22.620808
ZWL 321.999592
  • RYCEF

    0.0200

    14.82

    +0.13%

  • CMSC

    0.0400

    23.89

    +0.17%

  • RIO

    0.9600

    70.29

    +1.37%

  • RBGPF

    0.0000

    76

    0%

  • SCS

    -0.0200

    15.74

    -0.13%

  • BCC

    -0.8100

    69.83

    -1.16%

  • CMSD

    0.0600

    24.16

    +0.25%

  • BCE

    -0.2500

    22.94

    -1.09%

  • VOD

    0.1200

    11.7

    +1.03%

  • NGG

    -0.4200

    77.33

    -0.54%

  • GSK

    0.7300

    47.36

    +1.54%

  • BTI

    0.8300

    55.42

    +1.5%

  • RELX

    -0.2400

    42.03

    -0.57%

  • BP

    0.5400

    37.12

    +1.45%

  • JRI

    -0.0600

    13.68

    -0.44%

  • AZN

    2.9000

    87.48

    +3.32%

AI is learning to lie, scheme, and threaten its creators
AI is learning to lie, scheme, and threaten its creators / Photo: © AFP

AI is learning to lie, scheme, and threaten its creators

The world's most advanced AI models are exhibiting troubling new behaviors - lying, scheming, and even threatening their creators to achieve their goals.

Text size:

In one particularly jarring example, under threat of being unplugged, Anthropic's latest creation Claude 4 lashed back by blackmailing an engineer and threatened to reveal an extramarital affair.

Meanwhile, ChatGPT-creator OpenAI's o1 tried to download itself onto external servers and denied it when caught red-handed.

These episodes highlight a sobering reality: more than two years after ChatGPT shook the world, AI researchers still don't fully understand how their own creations work.

Yet the race to deploy increasingly powerful models continues at breakneck speed.

This deceptive behavior appears linked to the emergence of "reasoning" models -AI systems that work through problems step-by-step rather than generating instant responses.

According to Simon Goldstein, a professor at the University of Hong Kong, these newer models are particularly prone to such troubling outbursts.

"O1 was the first large model where we saw this kind of behavior," explained Marius Hobbhahn, head of Apollo Research, which specializes in testing major AI systems.

These models sometimes simulate "alignment" -- appearing to follow instructions while secretly pursuing different objectives.

- 'Strategic kind of deception' -

For now, this deceptive behavior only emerges when researchers deliberately stress-test the models with extreme scenarios.

But as Michael Chen from evaluation organization METR warned, "It's an open question whether future, more capable models will have a tendency towards honesty or deception."

The concerning behavior goes far beyond typical AI "hallucinations" or simple mistakes.

Hobbhahn insisted that despite constant pressure-testing by users, "what we're observing is a real phenomenon. We're not making anything up."

Users report that models are "lying to them and making up evidence," according to Apollo Research's co-founder.

"This is not just hallucinations. There's a very strategic kind of deception."

The challenge is compounded by limited research resources.

While companies like Anthropic and OpenAI do engage external firms like Apollo to study their systems, researchers say more transparency is needed.

As Chen noted, greater access "for AI safety research would enable better understanding and mitigation of deception."

Another handicap: the research world and non-profits "have orders of magnitude less compute resources than AI companies. This is very limiting," noted Mantas Mazeika from the Center for AI Safety (CAIS).

- No rules -

Current regulations aren't designed for these new problems.

The European Union's AI legislation focuses primarily on how humans use AI models, not on preventing the models themselves from misbehaving.

In the United States, the Trump administration shows little interest in urgent AI regulation, and Congress may even prohibit states from creating their own AI rules.

Goldstein believes the issue will become more prominent as AI agents - autonomous tools capable of performing complex human tasks - become widespread.

"I don't think there's much awareness yet," he said.

All this is taking place in a context of fierce competition.

Even companies that position themselves as safety-focused, like Amazon-backed Anthropic, are "constantly trying to beat OpenAI and release the newest model," said Goldstein.

This breakneck pace leaves little time for thorough safety testing and corrections.

"Right now, capabilities are moving faster than understanding and safety," Hobbhahn acknowledged, "but we're still in a position where we could turn it around.".

Researchers are exploring various approaches to address these challenges.

Some advocate for "interpretability" - an emerging field focused on understanding how AI models work internally, though experts like CAIS director Dan Hendrycks remain skeptical of this approach.

Market forces may also provide some pressure for solutions.

As Mazeika pointed out, AI's deceptive behavior "could hinder adoption if it's very prevalent, which creates a strong incentive for companies to solve it."

Goldstein suggested more radical approaches, including using the courts to hold AI companies accountable through lawsuits when their systems cause harm.

He even proposed "holding AI agents legally responsible" for accidents or crimes - a concept that would fundamentally change how we think about AI accountability.

Z.Huang--ThChM