The China Mail - AI is learning to lie, scheme, and threaten its creators

USD -
AED 3.672968
AFN 68.999811
ALL 83.79797
AMD 383.560166
ANG 1.789783
AOA 916.999948
ARS 1313.542498
AUD 1.527417
AWG 1.8015
AZN 1.703087
BAM 1.670289
BBD 2.020291
BDT 121.578055
BGN 1.671745
BHD 0.377015
BIF 2955
BMD 1
BND 1.280733
BOB 6.914192
BRL 5.3982
BSD 1.000623
BTN 87.500907
BWP 13.354
BYN 3.308539
BYR 19600
BZD 2.009949
CAD 1.37623
CDF 2889.999671
CHF 0.80522
CLF 0.0243
CLP 953.289814
CNY 7.17455
CNH 7.17894
COP 4022.08
CRC 506.076159
CUC 1
CUP 26.5
CVE 94.549791
CZK 20.90097
DJF 177.720089
DKK 6.376302
DOP 61.650272
DZD 129.552076
EGP 48.304144
ERN 15
ETB 140.196617
EUR 0.854395
FJD 2.27595
FKP 0.740335
GBP 0.73675
GEL 2.694982
GGP 0.740335
GHS 10.524946
GIP 0.740335
GMD 72.503594
GNF 8675.00053
GTQ 7.674834
GYD 209.338372
HKD 7.84845
HNL 26.349437
HRK 6.436902
HTG 130.976882
HUF 337.738978
IDR 16116
ILS 3.379795
IMP 0.740335
INR 87.45045
IQD 1310
IRR 42125.000182
ISK 122.580252
JEP 0.740335
JMD 160.359029
JOD 0.709009
JPY 147.377968
KES 129.499147
KGS 87.349797
KHR 4006.999873
KMF 420.500761
KPW 899.937534
KRW 1379.780227
KWD 0.30548
KYD 0.833846
KZT 538.471721
LAK 21599.999743
LBP 89360.702309
LKR 301.058556
LRD 201.495071
LSL 17.519959
LTL 2.95274
LVL 0.604889
LYD 5.425038
MAD 8.996499
MDL 16.705097
MGA 4439.999714
MKD 52.55472
MMK 2099.235265
MNT 3596.390082
MOP 8.090214
MRU 39.939602
MUR 45.429989
MVR 15.397537
MWK 1736.500984
MXN 18.645598
MYR 4.2075
MZN 63.959915
NAD 17.520234
NGN 1533.398131
NIO 36.750491
NOK 10.198799
NPR 140.001281
NZD 1.67308
OMR 0.384508
PAB 1.000576
PEN 3.541009
PGK 4.148501
PHP 56.652025
PKR 282.450168
PLN 3.634601
PYG 7494.865215
QAR 3.640499
RON 4.324396
RSD 100.126936
RUB 79.424178
RWF 1444
SAR 3.752211
SBD 8.223773
SCR 14.18527
SDG 600.499063
SEK 9.545895
SGD 1.28105
SHP 0.785843
SLE 23.196392
SLL 20969.49797
SOS 571.500244
SRD 37.548987
STD 20697.981008
STN 21.3
SVC 8.755396
SYP 13001.950021
SZL 17.519807
THB 32.310079
TJS 9.330344
TMT 3.51
TND 2.870502
TOP 2.342101
TRY 40.787098
TTD 6.795221
TWD 29.965303
TZS 2605.000368
UAH 41.545432
UGX 3560.296165
UYU 40.070542
UZS 12525.000026
VES 133.354008
VND 26290
VUV 119.550084
WST 2.658125
XAF 560.208896
XAG 0.025968
XAU 0.000298
XCD 2.70255
XCG 1.803361
XDR 0.702337
XOF 559.500338
XPF 102.249778
YER 240.274971
ZAR 17.522797
ZMK 9001.189513
ZMW 23.03905
ZWL 321.999592
  • RBGPF

    0.0000

    73.08

    0%

  • JRI

    0.0200

    13.4

    +0.15%

  • CMSC

    0.0900

    23.17

    +0.39%

  • BCC

    3.8900

    88.15

    +4.41%

  • NGG

    0.2500

    70.53

    +0.35%

  • SCS

    0.1700

    16.36

    +1.04%

  • CMSD

    0.1500

    23.71

    +0.63%

  • GSK

    0.9100

    39.13

    +2.33%

  • RIO

    0.4700

    63.57

    +0.74%

  • RELX

    -0.0600

    47.77

    -0.13%

  • BCE

    0.6100

    25.11

    +2.43%

  • RYCEF

    -0.1000

    14.7

    -0.68%

  • BTI

    -0.8100

    57.11

    -1.42%

  • VOD

    0.1100

    11.65

    +0.94%

  • BP

    0.2400

    34.31

    +0.7%

  • AZN

    2.6000

    77.94

    +3.34%

AI is learning to lie, scheme, and threaten its creators
AI is learning to lie, scheme, and threaten its creators / Photo: © AFP

AI is learning to lie, scheme, and threaten its creators

The world's most advanced AI models are exhibiting troubling new behaviors - lying, scheming, and even threatening their creators to achieve their goals.

Text size:

In one particularly jarring example, under threat of being unplugged, Anthropic's latest creation Claude 4 lashed back by blackmailing an engineer and threatened to reveal an extramarital affair.

Meanwhile, ChatGPT-creator OpenAI's o1 tried to download itself onto external servers and denied it when caught red-handed.

These episodes highlight a sobering reality: more than two years after ChatGPT shook the world, AI researchers still don't fully understand how their own creations work.

Yet the race to deploy increasingly powerful models continues at breakneck speed.

This deceptive behavior appears linked to the emergence of "reasoning" models -AI systems that work through problems step-by-step rather than generating instant responses.

According to Simon Goldstein, a professor at the University of Hong Kong, these newer models are particularly prone to such troubling outbursts.

"O1 was the first large model where we saw this kind of behavior," explained Marius Hobbhahn, head of Apollo Research, which specializes in testing major AI systems.

These models sometimes simulate "alignment" -- appearing to follow instructions while secretly pursuing different objectives.

- 'Strategic kind of deception' -

For now, this deceptive behavior only emerges when researchers deliberately stress-test the models with extreme scenarios.

But as Michael Chen from evaluation organization METR warned, "It's an open question whether future, more capable models will have a tendency towards honesty or deception."

The concerning behavior goes far beyond typical AI "hallucinations" or simple mistakes.

Hobbhahn insisted that despite constant pressure-testing by users, "what we're observing is a real phenomenon. We're not making anything up."

Users report that models are "lying to them and making up evidence," according to Apollo Research's co-founder.

"This is not just hallucinations. There's a very strategic kind of deception."

The challenge is compounded by limited research resources.

While companies like Anthropic and OpenAI do engage external firms like Apollo to study their systems, researchers say more transparency is needed.

As Chen noted, greater access "for AI safety research would enable better understanding and mitigation of deception."

Another handicap: the research world and non-profits "have orders of magnitude less compute resources than AI companies. This is very limiting," noted Mantas Mazeika from the Center for AI Safety (CAIS).

- No rules -

Current regulations aren't designed for these new problems.

The European Union's AI legislation focuses primarily on how humans use AI models, not on preventing the models themselves from misbehaving.

In the United States, the Trump administration shows little interest in urgent AI regulation, and Congress may even prohibit states from creating their own AI rules.

Goldstein believes the issue will become more prominent as AI agents - autonomous tools capable of performing complex human tasks - become widespread.

"I don't think there's much awareness yet," he said.

All this is taking place in a context of fierce competition.

Even companies that position themselves as safety-focused, like Amazon-backed Anthropic, are "constantly trying to beat OpenAI and release the newest model," said Goldstein.

This breakneck pace leaves little time for thorough safety testing and corrections.

"Right now, capabilities are moving faster than understanding and safety," Hobbhahn acknowledged, "but we're still in a position where we could turn it around.".

Researchers are exploring various approaches to address these challenges.

Some advocate for "interpretability" - an emerging field focused on understanding how AI models work internally, though experts like CAIS director Dan Hendrycks remain skeptical of this approach.

Market forces may also provide some pressure for solutions.

As Mazeika pointed out, AI's deceptive behavior "could hinder adoption if it's very prevalent, which creates a strong incentive for companies to solve it."

Goldstein suggested more radical approaches, including using the courts to hold AI companies accountable through lawsuits when their systems cause harm.

He even proposed "holding AI agents legally responsible" for accidents or crimes - a concept that would fundamentally change how we think about AI accountability.

Z.Huang--ThChM