After the Conference

Culture | Stories

Apr 2

A Short Story on the Ethics of Using AI Datasets with Unknown Provenance

This is fiction, but the ethical positions are loosely shaped by these thinkers’ public work on consent, bias, accountability, human character, sustainability, and long-term AI risk. Jump toTL:DR

We ended up at a bar because that’s where these conversations go when the conference version of them runs out of oxygen.

Not a cool bar. Not a bad bar either. Just one of those in-between places near a convention centre where the lighting is too dim for networking and too bright for lying to yourself. Sticky wood tables. Half-heard conversations. Somebody laughing too hard near the back. Hockey on a muted screen.

I was there with a friend of mine who is smart enough to follow almost anything, but only if you stop talking like a brochure. Curious guy. Fast mind. A little overloaded by default. The kind of person who asks good questions right when everyone else is pretending they understood the answer.

He had that look on his face I’ve seen a lot lately.

Not anti-AI. Not pro-AI. More like: I can tell something important is happening, and I don’t trust the people selling it.

Fair.

By some weird post-conference gravity, the table filled up with Alice Xiang, Yoshua Bengio, Shannon Vallor, Abeba Birhane, and Aimee van Wynsberghe. Which sounds improbable, and maybe it was, but it also felt like one of those moments where reality briefly gets its act together.

My friend looked around at all of them, took a sip of his beer, and just asked it straight.

“Okay. Simple version. Is it ethical to use AI datasets when we don’t really know where the data came from?”

Good question.

Actually, better than good. Clean. No fluff. Cuts right past the branding.

Alice was the first to answer, and she answered the way people do when they’ve spent enough time around bad systems that they no longer bother decorating the problem.

“If you don’t know where the data came from,” she said, “you probably don’t know who got pulled into the system without meaning to be. And if you don’t know that, then you don’t really know what you built.”

She didn’t say it dramatically. That was the thing. No big performance. Just matter-of-fact.

She talked about how even the basic act of checking for bias becomes slippery when the underlying data was collected in ways nobody can really account for. If the dataset itself is a mess, then all the fairness language stacked on top of it starts to feel a bit fake. Like quality control on stolen parts. Her public work has pushed hard on ethically sourced, consent-based benchmarking because otherwise even your “measurement” of fairness can be contaminated from the start.

My friend nodded slowly.

“So that’s the rules argument?”

“Partly,” I said.

Alice shrugged. “Call it duty, call it legitimacy, call it basic respect. If people didn’t agree to be in the system, and you can’t explain how their data got there, you’ve got a moral problem before you even get to model performance.”

That landed.

Then Bengio leaned in, hands around his glass like he was trying to warm them.

“I agree,” he said. “But even if you put consent aside for a second, you still have to ask what the system is doing in the world. What scale of harm are we enabling? What future are we normalizing?”

That was his lane. Not in the cartoon way people sometimes frame him, but in the sober way of somebody who has spent years looking at capability curves and not liking where they point. His recent work has focused heavily on advanced AI risk, safety-by-design, and the need to take longer-term consequences seriously instead of treating them like science fiction until it is too late.

He kept going.

“Bad data is not just an origin problem. It becomes an outcome problem. Healthcare systems. Hiring tools. Decision support. Misinformation. Surveillance. If the system scales, the harm scales.”

My friend looked at me. “So that’s more like consequences.”

“Yeah,” I said. “Not just ‘was the input clean?’ but ‘what happens when this thing leaves the lab and starts deciding who gets seen, hired, flagged, priced, diagnosed, trusted.’”

Abeba gave this tiny nod, like the word trusted had annoyed her a bit.

“I think people jump too fast to abstract ethics,” she said. “The issue is also power. Whose data gets taken. Whose labour gets hidden. Whose communities get misrepresented. Whose harm gets treated as acceptable collateral.”

There it was.

The room didn’t go quiet exactly, but the table did.

Because that’s the part a lot of tech conversations try to skip past. They want to talk about bias like it’s a technical imperfection. As if we are all standing the same distance from the blast radius.

We are not.

Birhane’s work has repeatedly exposed how large internet-scraped datasets can scale hateful, racist, misogynistic, and otherwise harmful content rather than wash it out, and how those costs often fall hardest on already marginalized people. She has also been blunt about how opacity and exploitative labour get buried inside the AI supply chain.

“So no,” she said. “It’s not just ‘we don’t know where the data came from.’ It’s that we often do know enough to know it’s dirty, and we keep going anyway because the incentives reward scale.”

My friend exhaled through his nose.

“Okay. That one I understand.”

Shannon Vallor smiled a little, but not because anything was funny.

“What interests me,” she said, “is what habits we are building in ourselves while we make these systems. What kind of people do we become when convenience matters more than wisdom, when efficiency matters more than care, when imitation matters more than understanding?”

That could have gone pretentious in someone else’s mouth. It didn’t.

She made it feel practical.

Like this wasn’t some seminar detour into the soul. It was about culture. Practice. Repetition. The moral grooves we carve by doing the same thing over and over and calling it innovation.

Her public work asks whether AI strengthens virtues like wisdom, care, and creativity, or instead reinforces less humane patterns. That’s a different kind of ethics problem. Not just “what rule was broken” or “what harm occurred,” but “what are we training ourselves to value?”

My friend laughed once.

“So basically: if we build systems by acting like creeps, maybe that does something to us.”

Shannon lifted her glass. “That is one way to put it.”

Then Aimee jumped in, and she widened the frame.

“You also have to ask what it costs to sustain these systems,” she said. “Not just socially, but environmentally, institutionally, economically. People keep acting like ethics ends at the user interface.”

She talked about infrastructure, energy, extraction, regulatory design, and how “responsible AI” can become too narrow if it only looks at immediate user harms while ignoring the wider system. Her work at Bonn’s Sustainable AI Lab has been focused exactly there: the environmental, social, and economic costs of designing, developing, and using AI, and the need for green, proportionate, sustainable governance rather than ethics theatre.

“An AI system can be consent-based and still be irresponsible,” she said. “It can still concentrate power. Still burn resources. Still create dependency. Still reward bad institutional behavior.”

That was the turn.

Because until then, the question had been: is it ethical to use data when you don’t understand where it came from?

And the table, more or less, had said: no. Or at minimum: not without serious moral debt attached.

But then my friend asked the better follow-up.

“Fine. Let’s do the fantasy version. Clean slate. Everybody consented. Everybody got paid. Full transparency on data collection. Now what? Is it ethical then?”

Nobody answered right away.

Which I appreciated.

Because if someone answers too fast there, they’re usually trying to sell you a framework, not tell you the truth.

Alice went first again.

“Then you need provenance that stays visible,” she said. “Not just once, not in a press release. Ongoing traceability. What data was used, under what terms, with what limitations, for which purpose.”

“Like chain of custody,” I said.

“Exactly.”

Not because transparency magically fixes everything, but because without it accountability becomes theatre. If the model causes harm and nobody can tell what went into it, who approved it, who profited from it, or what testing was done, then “responsible AI” is just branding with nicer fonts.

Bengio picked it up from there.

“And then rigorous evaluation,” he said. “Not benchmark worship. Real testing in real contexts, with uncertainty taken seriously.”

Healthcare came up first.

Say you build an AI tool to help detect disease from scans. Great. On paper. But who trained it? On what patient populations? Does it perform equally well across age groups, skin tones, equipment standards, hospitals, regions? What happens when doctors trust it too much? What happens when administrators use it to cut staff because the spreadsheet says the machine is good enough?

That’s the thing with AI. The model is never the whole story. The deployment context is part of the morality.

Same with hiring.

A system that filters candidates can be built on clean, consented data and still end up reproducing class bias, educational bias, language bias, disability bias, all kinds of bias. Because fairness is not just about whether the data was obtained properly. It’s also about what success variable you optimize for, what proxies you use, and who gets treated as a statistical inconvenience.

Abeba was sharp on that.

“People love saying ‘bias testing’ like it’s a final exam,” she said. “It isn’t. Fairness is ongoing. Representation is ongoing. Audit is ongoing. Communities affected by the system need meaningful ways to challenge it.”

Not symbolic consultation. Actual leverage.

In creative work, things got even messier.

Photography. Writing. Illustration. Music.

The friend looked at me because he knew that one would hit home.

If creators are consented and compensated, good. Obviously better. Much better. But then new questions show up.

Who gets included and who gets left out?

What happens to the market when synthetic abundance drives down the value of human work anyway?

What gets rewarded, the original creator, the platform, the model company, the distributor, the prompt engineer, the client who now expects ten concepts for the price of one?

Consent matters. Payment matters. But economic displacement is not solved just because the intake form was legal.

Aimee made that point cleanly.

“You have to evaluate system effects, not just transaction ethics,” she said. “A tool can be fair at the point of data collection and still unfair in the world it helps create.”

That one stayed with me.

Because it cuts through a lot of fake moral certainty.

The bar had thinned out by then. Chairs upside down on some tables. Someone wiping down bottles. That end-of-night feeling where everybody’s voice gets a little more honest because there’s no point posturing anymore.

My friend looked around the table.

“So what does ethical AI actually look like?”

Shannon answered softly.

“Probably less like a product and more like a practice.”

Then, after a second:

“Transparency. Accountability. Human oversight. Contestability. Limits. Some uses should be slowed down. Some maybe should not exist. And all of it should be shaped by the kind of future we actually want to inhabit, not just the one we are technically capable of building.”

That felt right.

Not neat. Right.

Because the honest answer is that consent alone is not enough. Compensation alone is not enough. Clean data alone is not enough.

You can solve the origin problem and still build a system that centralizes power, erodes trust, automates discrimination, floods the public sphere with synthetic sludge, pressures human workers, and gives institutions one more excuse to avoid human judgment while pretending to be objective.

Ethical AI, if that phrase is going to mean anything at all, has to be ongoing.

Not a checkbox.

A living system of traceability, audit, restraint, governance, challenge, revision, and actual responsibility when things go wrong.

And even then, some tension remains.

Because there will still be tradeoffs.

Still grey zones.

Still people pretending uncertainty means permission.

Still companies treating ethics like a cost centre.

Still real benefits in medicine, accessibility, science, and education pulling in one direction, while concentration of power, labour disruption, environmental cost, and manipulation pull in the other.

No tidy ending.

Probably a good sign.

When we finally got up to leave, my friend put on his jacket and said, “So the answer is basically: if you don’t know where the data came from, you’re already in trouble. And if you do fix that, congratulations, now the real ethics work starts.”

Pretty much.

Outside, the street was wet and reflecting the city back at itself in broken pieces.

Which, now that I think about it, is maybe the closest I’ll get to a metaphor.

TL;DR

If you do not know where AI training data came from, you have a legitimacy problem before you even get to performance.
Consent and compensation matter, but they do not solve everything.
Ethical AI also needs transparency, provenance, traceability, and clear accountability.
Fairness is not a one-time benchmark. It needs ongoing testing, audit, and challenge.
Real-world deployment matters as much as model design, especially in healthcare, hiring, media, and law.
Creative industries are not “solved” just because creators were paid. Market effects and power concentration still matter.
Ethical AI is not a badge. It is a system of ongoing human judgment, oversight, and restraint.

If you would like to see the reasoning behind this opinion, please read the white paper.

Cedric Swaneck

Portrait and Commercial Photographer

https://www.cedricswaneck.com

After the Conference

TL;DR

Prep Is Love: How I Work With People

cedric@swaneck.com