Inbred, gibberish or just MAD? Warnings rise about AI models

Berliner Boersenzeitung - Inbred, gibberish or just MAD? Warnings rise about AI models

Berlin 0°C

EUR -

AED 3.849531

AFN 71.26801

ALL 97.489577

AMD 407.133958

ANG 1.888735

AOA 957.394851

ARS 1052.235814

AUD 1.609184

AWG 1.889117

AZN 1.791112

BAM 1.948361

BBD 2.11583

BDT 125.23708

BGN 1.955359

BHD 0.395016

BIF 3036.735477

BMD 1.048054

BND 1.408323

BOB 7.241353

BRL 6.093912

BSD 1.047904

BTN 88.545444

BWP 14.307376

BYN 3.429805

BYR 20541.851716

BZD 2.112535

CAD 1.464126

CDF 3007.913807

CHF 0.929383

CLF 0.036979

CLP 1020.374446

CNY 7.58351

CNH 7.604227

COP 4600.169523

CRC 532.71786

CUC 1.048054

CUP 27.773422

CVE 110.700709

CZK 25.372333

DJF 186.259983

DKK 7.459244

DOP 63.303486

DZD 140.007168

EGP 52.063095

ERN 15.720805

ETB 129.33436

FJD 2.406641

FKP 0.827247

GBP 0.832107

GEL 2.855927

GGP 0.827247

GHS 16.611633

GIP 0.827247

GMD 74.411853

GNF 9044.703289

GTQ 8.090113

GYD 219.262881

HKD 8.156703

HNL 26.384765

HRK 7.476038

HTG 137.59468

HUF 411.518243

IDR 16686.95315

ILS 3.893142

IMP 0.827247

INR 88.546488

IQD 1373.47432

IRR 44128.299527

ISK 146.119923

JEP 0.827247

JMD 166.434573

JOD 0.743174

JPY 161.922177

KES 135.721253

KGS 90.647778

KHR 4244.617195

KMF 492.218524

KPW 943.247896

KRW 1467.647167

KWD 0.322423

KYD 0.873366

KZT 519.705991

LAK 23015.258108

LBP 93853.205449

LKR 304.92583

LRD 188.911965

LSL 18.979978

LTL 3.09463

LVL 0.633958

LYD 5.119716

MAD 10.495157

MDL 19.084139

MGA 4895.458406

MKD 61.536096

MMK 3404.037402

MNT 3561.286277

MOP 8.401263

MRU 41.833101

MUR 48.629757

MVR 16.192506

MWK 1819.421082

MXN 21.389077

MYR 4.679539

MZN 66.973014

NAD 18.980034

NGN 1775.591527

NIO 38.557996

NOK 11.596507

NPR 141.673109

NZD 1.78734

OMR 0.403491

PAB 1.047999

PEN 3.977392

PGK 4.219989

PHP 61.814724

PKR 291.266876

PLN 4.34356

PYG 8225.282947

QAR 3.815701

RON 4.977107

RSD 117.009991

RUB 106.166872

RWF 1436.881566

SAR 3.934587

SBD 8.757045

SCR 14.317421

SDG 630.390661

SEK 11.590944

SGD 1.411131

SHP 0.827247

SLE 23.670312

SLL 21977.166166

SOS 598.957702

SRD 37.106378

STD 21692.594729

SVC 9.16999

SYP 2633.266111

SZL 18.99125

THB 36.403062

TJS 11.161487

TMT 3.678668

TND 3.304543

TOP 2.454645

TRY 36.144389

TTD 7.11384

TWD 34.114983

TZS 2779.814551

UAH 43.266675

UGX 3872.069131

USD 1.048054

UYU 44.658222

UZS 13498.931116

VES 48.495894

VND 26644.144146

VUV 124.427036

WST 2.925737

XAF 653.462161

XAG 0.034053

XAU 0.000392

XCD 2.832418

XDR 0.799448

XOF 651.889416

XPF 119.331742

YER 261.9079

ZAR 18.971032

ZMK 9433.736719

ZMW 28.899665

ZWL 337.472851

RBGPF

-0.5000

59.69

-0.84%
RYCEF

0.1800

6.79

+2.65%
RELX

0.6500

45.76

+1.42%
SCS

-0.0300

13.04

-0.23%
RIO

0.1800

62.57

+0.29%
AZN

1.0600

64.26

+1.65%
CMSC

0.1200

24.64

+0.49%
GSK

0.3500

33.7

+1.04%
CMSD

0.1850

24.445

+0.76%
BTI

-0.1000

36.98

-0.27%
NGG

-0.1700

63.1

-0.27%
BCC

2.9500

140.36

+2.1%
JRI

0.0000

13.23

0%
VOD

-0.1000

8.84

-1.13%
BCE

-0.3200

26.68

-1.2%
BP

0.4400

29.52

+1.49%

Inbred, gibberish or just MAD? Warnings rise about AI models / Photo: Fabrice COFFRINI - AFP/File

Inbred, gibberish or just MAD? Warnings rise about AI models

TECHNOLOGY 05.08.2024

When academic Jathan Sadowski reached for an analogy last year to describe how AI programs decay, he landed on the term "Habsburg AI".

Text size:

The Habsburgs were one of Europe's most powerful royal houses, but entire sections of their family line collapsed after centuries of inbreeding.

Recent studies have shown how AI programs underpinning products like ChatGPT go through a similar collapse when they are repeatedly fed their own data.

"I think the term Habsburg AI has aged very well," Sadowski told AFP, saying his coinage had "only become more relevant for how we think about AI systems".

The ultimate concern is that AI-generated content could take over the web, which could in turn render chatbots and image generators useless and throw a trillion-dollar industry into a tailspin.

But other experts argue that the problem is overstated, or can be fixed.

And many companies are enthusiastic about using what they call synthetic data to train AI programs. This artificially generated data is used to augment or replace real-world data. It is cheaper than human-created content but more predictable.

"The open question for researchers and companies building AI systems is: how much synthetic data is too much," said Sadowski, lecturer in emerging technologies at Australia's Monash University.

- 'Mad cow disease' -

Training AI programs, known in the industry as large language models (LLMs), involves scraping vast quantities of text or images from the internet.

This information is broken into trillions of tiny machine-readable chunks, known as tokens.

When asked a question, a program like ChatGPT selects and assembles tokens in a way that its training data tells it is the most likely sequence to fit with the query.

But even the best AI tools generate falsehoods and nonsense, and critics have long expressed concern about what would happen if a model was fed on its own outputs.

In late July, a paper in the journal Nature titled "AI models collapse when trained on recursively generated data" proved a lightning rod for discussion.

The authors described how models quickly discarded rarer elements in their original dataset and, as Nature reported, outputs degenerated into "gibberish".

A week later, researchers from Rice and Stanford universities published a paper titled "Self-consuming generative models go MAD" that reached a similar conclusion.

They tested image-generating AI programs and showed that outputs become more generic and strafed with undesirable elements as they added AI-generated data to the underlying model.

They labelled model collapse "Model Autophagy Disorder" (MAD) and compared it to mad cow disease, a fatal illness caused by feeding the remnants of dead cows to other cows.

- 'Doomsday scenario' -

These researchers worry that AI-generated text, images and video are clearing the web of usable human-made data.

"One doomsday scenario is that if left uncontrolled for many generations, MAD could poison the data quality and diversity of the entire internet," one of the Rice University authors, Richard Baraniuk, said in a statement.

However, industry figures are unfazed.

Anthropic and Hugging Face, two leaders in the field who pride themselves on taking an ethical approach to the technology, both told AFP they used AI-generated data to fine-tune or filter their datasets.

Anton Lozhkov, machine learning engineer at Hugging Face, said the Nature paper gave an interesting theoretical perspective but its disaster scenario was not realistic.

"Training on multiple rounds of synthetic data is simply not done in reality," he said.

However, he said researchers were just as frustrated as everyone else with the state of the internet.

"A large part of the internet is trash," he said, adding that Hugging Face already made huge efforts to clean data -- sometimes jettisoning as much as 90 percent.

He hoped that web users would help clear up the internet by simply not engaging with generated content.

"I strongly believe that humans will see the effects and catch generated data way before models will," he said.

(H.Schneide--BBZ)

Berliner Boersenzeitung - Inbred, gibberish or just MAD? Warnings rise about AI models

Inbred, gibberish or just MAD? Warnings rise about AI models

Featured

Chimps are upping their tool game, says study

The first 'zoomed-in' image of a star outside our galaxy

Historic gold regalia returned to Ghana's king

Endometriosis linked to slightly higher risk of early death