The China Mail - Inbred, gibberish or just MAD? Warnings rise about AI models

USD -
AED 3.672494
AFN 64.562923
ALL 81.175019
AMD 377.570137
ANG 1.789862
AOA 917.000023
ARS 1396.858798
AUD 1.410218
AWG 1.8025
AZN 1.701559
BAM 1.646095
BBD 2.014569
BDT 122.333554
BGN 1.647989
BHD 0.376906
BIF 2955
BMD 1
BND 1.261126
BOB 6.911847
BRL 5.213198
BSD 1.000215
BTN 90.656892
BWP 13.115002
BYN 2.867495
BYR 19600
BZD 2.011792
CAD 1.36115
CDF 2240.00016
CHF 0.769425
CLF 0.021707
CLP 857.109732
CNY 6.90065
CNH 6.89775
COP 3669.75
CRC 487.566753
CUC 1
CUP 26.5
CVE 93.349806
CZK 20.427038
DJF 177.719679
DKK 6.29313
DOP 62.249857
DZD 129.607009
EGP 46.842602
ERN 15
ETB 155.301624
EUR 0.842445
FJD 2.1911
FKP 0.732521
GBP 0.73423
GEL 2.690215
GGP 0.732521
GHS 11.005011
GIP 0.732521
GMD 73.508506
GNF 8775.000212
GTQ 7.671623
GYD 209.274433
HKD 7.816585
HNL 26.500379
HRK 6.3485
HTG 130.97728
HUF 319.369497
IDR 16815.6
ILS 3.063925
IMP 0.732521
INR 90.56445
IQD 1310.5
IRR 42125.000158
ISK 122.329897
JEP 0.732521
JMD 156.251973
JOD 0.708978
JPY 152.904502
KES 128.999973
KGS 87.449928
KHR 4022.000013
KMF 416.000178
KPW 899.988812
KRW 1440.306863
KWD 0.306698
KYD 0.833596
KZT 494.926752
LAK 21450.000409
LBP 85549.999856
LKR 309.456576
LRD 186.398647
LSL 15.939904
LTL 2.95274
LVL 0.60489
LYD 6.305028
MAD 9.146997
MDL 16.94968
MGA 4405.000264
MKD 51.911901
MMK 2100.304757
MNT 3579.516219
MOP 8.054945
MRU 39.902206
MUR 45.870039
MVR 15.450137
MWK 1736.500548
MXN 17.21605
MYR 3.9025
MZN 63.899754
NAD 15.959866
NGN 1353.030212
NIO 36.700226
NOK 9.538298
NPR 145.04947
NZD 1.657295
OMR 0.384501
PAB 1.000332
PEN 3.354506
PGK 4.29275
PHP 58.015018
PKR 279.55019
PLN 3.550335
PYG 6585.896503
QAR 3.64125
RON 4.289397
RSD 98.906967
RUB 77.217884
RWF 1456
SAR 3.749958
SBD 8.038668
SCR 13.815762
SDG 601.498228
SEK 8.92764
SGD 1.262285
SHP 0.750259
SLE 24.449867
SLL 20969.501971
SOS 571.499594
SRD 37.778993
STD 20697.981008
STN 20.9
SVC 8.752299
SYP 11059.574895
SZL 15.939822
THB 31.070101
TJS 9.417602
TMT 3.51
TND 2.839837
TOP 2.40776
TRY 43.733698
TTD 6.776109
TWD 31.431905
TZS 2600.000179
UAH 43.023284
UGX 3540.813621
UYU 38.353905
UZS 12295.000358
VES 389.80653
VND 25960
VUV 119.359605
WST 2.711523
XAF 552.10356
XAG 0.013145
XAU 0.000202
XCD 2.70255
XCG 1.802726
XDR 0.686599
XOF 552.485566
XPF 101.000009
YER 238.325027
ZAR 15.958605
ZMK 9001.199613
ZMW 18.555599
ZWL 321.999592
  • RBGPF

    0.1000

    82.5

    +0.12%

  • BTI

    0.2800

    60.61

    +0.46%

  • GSK

    0.0500

    58.54

    +0.09%

  • NGG

    0.5800

    91.22

    +0.64%

  • CMSC

    0.0000

    23.7

    0%

  • AZN

    -0.2400

    204.52

    -0.12%

  • RELX

    1.0800

    28.81

    +3.75%

  • RIO

    -1.6100

    97.91

    -1.64%

  • VOD

    -0.0600

    15.62

    -0.38%

  • RYCEF

    -0.0600

    16.87

    -0.36%

  • BCE

    0.1800

    25.83

    +0.7%

  • BCC

    -1.3500

    88.06

    -1.53%

  • BP

    -1.3600

    37.19

    -3.66%

  • CMSD

    -0.1280

    23.942

    -0.53%

  • JRI

    0.0300

    13.16

    +0.23%

Inbred, gibberish or just MAD? Warnings rise about AI models
Inbred, gibberish or just MAD? Warnings rise about AI models / Photo: © AFP/File

Inbred, gibberish or just MAD? Warnings rise about AI models

When academic Jathan Sadowski reached for an analogy last year to describe how AI programs decay, he landed on the term "Habsburg AI".

Text size:

The Habsburgs were one of Europe's most powerful royal houses, but entire sections of their family line collapsed after centuries of inbreeding.

Recent studies have shown how AI programs underpinning products like ChatGPT go through a similar collapse when they are repeatedly fed their own data.

"I think the term Habsburg AI has aged very well," Sadowski told AFP, saying his coinage had "only become more relevant for how we think about AI systems".

The ultimate concern is that AI-generated content could take over the web, which could in turn render chatbots and image generators useless and throw a trillion-dollar industry into a tailspin.

But other experts argue that the problem is overstated, or can be fixed.

And many companies are enthusiastic about using what they call synthetic data to train AI programs. This artificially generated data is used to augment or replace real-world data. It is cheaper than human-created content but more predictable.

"The open question for researchers and companies building AI systems is: how much synthetic data is too much," said Sadowski, lecturer in emerging technologies at Australia's Monash University.

- 'Mad cow disease' -

Training AI programs, known in the industry as large language models (LLMs), involves scraping vast quantities of text or images from the internet.

This information is broken into trillions of tiny machine-readable chunks, known as tokens.

When asked a question, a program like ChatGPT selects and assembles tokens in a way that its training data tells it is the most likely sequence to fit with the query.

But even the best AI tools generate falsehoods and nonsense, and critics have long expressed concern about what would happen if a model was fed on its own outputs.

In late July, a paper in the journal Nature titled "AI models collapse when trained on recursively generated data" proved a lightning rod for discussion.

The authors described how models quickly discarded rarer elements in their original dataset and, as Nature reported, outputs degenerated into "gibberish".

A week later, researchers from Rice and Stanford universities published a paper titled "Self-consuming generative models go MAD" that reached a similar conclusion.

They tested image-generating AI programs and showed that outputs become more generic and strafed with undesirable elements as they added AI-generated data to the underlying model.

They labelled model collapse "Model Autophagy Disorder" (MAD) and compared it to mad cow disease, a fatal illness caused by feeding the remnants of dead cows to other cows.

- 'Doomsday scenario' -

These researchers worry that AI-generated text, images and video are clearing the web of usable human-made data.

"One doomsday scenario is that if left uncontrolled for many generations, MAD could poison the data quality and diversity of the entire internet," one of the Rice University authors, Richard Baraniuk, said in a statement.

However, industry figures are unfazed.

Anthropic and Hugging Face, two leaders in the field who pride themselves on taking an ethical approach to the technology, both told AFP they used AI-generated data to fine-tune or filter their datasets.

Anton Lozhkov, machine learning engineer at Hugging Face, said the Nature paper gave an interesting theoretical perspective but its disaster scenario was not realistic.

"Training on multiple rounds of synthetic data is simply not done in reality," he said.

However, he said researchers were just as frustrated as everyone else with the state of the internet.

"A large part of the internet is trash," he said, adding that Hugging Face already made huge efforts to clean data -- sometimes jettisoning as much as 90 percent.

He hoped that web users would help clear up the internet by simply not engaging with generated content.

"I strongly believe that humans will see the effects and catch generated data way before models will," he said.

C.Mak--ThChM