Stop Calling Data the New Oil for AI - What really drive AI Success
Show transcript
00:00:00: Welcome to Designing AI Heroes, where AI and people align to drive productivity and innovation.
00:00:10: This is the podcast that empowers businesses and individuals to integrate AI into their
00:00:15: workflows and workplace, unlocking their full potential in the digital age.
00:00:21: We bring you insights, strategies, and real-world applications to help you be an AI hero and
00:00:27: stay ahead.
00:00:29: Let's dive in.
00:00:33: Welcome to another episode in the Designing AI Hero podcast, and we often hear that data
00:00:40: is a new oil when it comes to AI, but is this really true or is it just to catch your new
00:00:46: slogan?
00:00:47: Today, we are unpacking the myths and realities of data in the age of AI, and I'm very happy
00:00:55: that I can speak for the second time with my guest, Sofia Raffa.
00:00:59: She's really an expert when it comes to data in AI, and Sofia, welcome back.
00:01:06: And please, again, introduce yourself and also explain briefly why our today's topic
00:01:11: really matters.
00:01:13: So hi, Nadine.
00:01:14: Hi, everyone.
00:01:15: I'm Sofia Raffa, Digital Marketing and AI Consultant, a helping startup, enterprises,
00:01:20: and mid-sized companies to grow and with digitalization.
00:01:24: So today's topic is really matters because actually nowadays, people say that data is
00:01:31: the new oil, but if you think about that this whole quotation came from Clive Humble from
00:01:37: 2006, but it's still valid nowadays in 2025.
00:01:42: So why was oil valuable?
00:01:46: Because oil is actually once you use it, then it's great for using for cars, engines, and
00:01:54: infrastructures.
00:01:55: And this is also applies for data.
00:01:58: So the raw data is nice to have, but basically you can't do anything until you are refining
00:02:03: it and then it comes actually, and then it comes powerful.
00:02:09: So data is everywhere, but without management and some structure, you cannot really use it.
00:02:15: You can't see the value and you can't utilize it.
00:02:19: That's why refining the data and generating data pipelines and engines and governance and
00:02:25: then later on using in AI is the real magic.
00:02:29: So it's like compared data with the oil is like, raw data is like the crude oil, thick,
00:02:36: messy, and unusable actually, but once you refine it in order to use it, then it feels
00:02:44: innovations and the competitive edge.
00:02:47: So unlike oil, data is renewable.
00:02:50: So the more you use it, the more information you can get out of it.
00:02:54: So the analytical key is useful, but it can complete as a takeaway and data is more like
00:03:01: renewable energy than oil.
00:03:02: Exactly.
00:03:03: So you can use it every time it's not gone when you use it.
00:03:09: The energy is not away or transformed in something different.
00:03:14: Exactly.
00:03:15: So the once you can define in which context you are using the data, then the more value
00:03:21: you can also generate out of it.
00:03:25: From your experience, Nadine, because you are also a consultant when it comes to business
00:03:30: and AI, what distinguishes between good data from bad data when it comes to AI applications?
00:03:38: I think this is what I see in my clients and companies that the biggest mistake is when
00:03:44: it comes to the AI project, that they choose quantity over quality.
00:03:50: It's not collecting as much data as possible and as you can, but choosing identify the
00:03:56: high quality data you can use for AI project because when you don't have the right data,
00:04:02: it's like garbage in, garbage out.
00:04:04: Yeah, it's like prompting garbage in, garbage out.
00:04:07: So it's the same with data, if garbage data, you get garbage things out of your AI project.
00:04:14: So that's the biggest mistake what I see with companies that they say, "Oh, we need much
00:04:18: data as we can," and I say, "Okay, start small, identify one focus, and then build
00:04:25: from there, clean your data."
00:04:30: And many don't know where the data is coming from, from which system.
00:04:34: So really, they have identified, "Okay, this data is there, this data is there."
00:04:39: So they're different systems and then it's the challenge how to get out of, get out
00:04:45: the data of the system and then clean it so that you can use it for AI project.
00:04:50: So good data is clean, accurate, and consistent, relevant, and representative, you can say.
00:04:55: Absolutely.
00:04:56: And bad data is incomplete, also biased, outdated, or mislabeled, and it's really misleads when
00:05:02: it comes to AI.
00:05:04: Yeah, at this point, I just want to give a little overview about what means labeled
00:05:09: data, because many times people think, "Oh, I have tons of data, and yeah, this is social
00:05:14: data, and this is market research data, or this is service data," but labeling means
00:05:21: that you are giving tags for the data, so AI can also understand it.
00:05:28: And once it understands, the better results it's going to generate.
00:05:31: So that's what we call supervised data.
00:05:34: So the less labeling all these data gets, the less accurate can be the results.
00:05:40: So it means that there are also unsupervised machine learning methods to do this, but at
00:05:46: the end of the day, the machine is going to figure out some patterns instead of using
00:05:52: the already available accurate data, and then based on these assumptions, you are going
00:05:58: to get the outcome.
00:06:00: So specifically in my area in marketing, we have tons of data everywhere.
00:06:05: So from CRM, social data, customer reviews, all types of data, and some of them is labeled,
00:06:14: some of them not.
00:06:15: So that's why there is a really good machine learning technique, the pseudo labeling, which
00:06:21: means that we use a small amount of labeled data, and then we generate labels for the
00:06:29: unlabeled data based on this one.
00:06:31: So that's a really good way how you can use 90% of your data for something meaningful
00:06:38: and useful.
00:06:41: And when it comes to data usage on handling data, what do you think are the biggest mistakes?
00:06:48: Do you have anything from your experience?
00:06:51: Oh, yeah.
00:06:53: So if you think about that, we are here middle of Europe in Germany, and the GDPR is the
00:07:00: most strength here.
00:07:01: So we have to do double opt-in in order to define, okay, which user wants what type of
00:07:08: data.
00:07:09: Many times what I hear from marketing people, please don't touch it.
00:07:13: And then, okay, but then why do you collect it?
00:07:17: So so many companies are collecting data, and then they just simply say, oh, please don't
00:07:21: touch it because it's super sensitive.
00:07:24: But if you don't use it, if you don't generate something meaningful, if you don't know where
00:07:30: the data comes from, and I'm referring back to your point, that data mapping and inventory,
00:07:36: so where is my data, where does it come from, and where do I save it?
00:07:42: If you don't know this, then you are going to be in a big trouble.
00:07:47: And also, there are so many times, that's why because the teams do not really want to
00:07:52: do too much with this, so to say sensitive data, then the data is dirty.
00:07:57: So it means that there are tons of duplications, there are missing fields, outdated records.
00:08:04: And then we also must talk about the legacy data and the legacy tools because nowadays,
00:08:12: people especially at bigger companies, corporations, they tend to forget that, oh, we have some
00:08:17: other database, which no one really touches, but still somehow our data flows there.
00:08:25: So that's also a really big issue because then you don't have the whole overview about
00:08:30: all your things.
00:08:32: So AI cannot really broken all the data foundations, but if your data house is messy, and if you
00:08:40: don't know where is your data, then AI just scale and express the mess faster.
00:08:47: Yeah.
00:08:48: I also see what you said, that there's a lot of fear of using data because, oh, we have
00:08:53: two people, it's maybe private, so we don't know if it's private, so we don't touch it.
00:09:00: And what I also see, you have data silos, there are different departments with a lot
00:09:08: of good data, and you use in one AI project, but they don't share because they have fear
00:09:15: to share it because I don't know what you're doing with my data.
00:09:19: But in AI project, it's breaking the silos down in companies, and if we really think
00:09:25: about, okay, one department has this data, the other department has this data, and it's
00:09:30: only available for this project when we combine it and bring it together in one AI project.
00:09:37: That's also what I see in companies with a fear.
00:09:40: Yeah.
00:09:41: So this is a kind of change management on every level.
00:09:45: So change management when it comes to...
00:09:47: business ethics, change management when it
00:09:51: comes to business processes,
00:09:53: and change management with behaviors.
00:09:56: So say that yeah,
00:09:59: these are our data and then we should do something with this,
00:10:04: and then make some utilization.
00:10:09: But talking about data ethics, Nadine,
00:10:13: what do you think, how important is
00:10:15: data ethics in development and deployment of AI systems?
00:10:19: It's absolutely crucial.
00:10:21: Without AI ethics, AI will backfire.
00:10:24: So the risk that we see the data can be just don't say,
00:10:30: it's okay when we identify the right data,
00:10:33: but just don't use it without thinking.
00:10:36: First, the data should have a purpose.
00:10:39: You should know what decision it makes,
00:10:42: or what you would decide with the data.
00:10:45: That's one thing.
00:10:46: So then start with one focus, with one area,
00:10:50: and then the problem is the risk that the data can be
00:10:53: biased or also leads to unfair decisions.
00:10:58: Maybe prefer a decision over
00:11:02: another decision but decision should be neutral,
00:11:05: and AI can only make.
00:11:07: So people say, AI is biased,
00:11:10: the AI system is not biased, the data is biased.
00:11:12: Exactly.
00:11:14: Yeah, the technology is bad or biased,
00:11:16: but the data is biased.
00:11:18: So you really have to identify if the data is
00:11:20: neutral and can make neutral decisions you would like to make.
00:11:24: Then it also can hurt privacy.
00:11:27: That means you lose a lot of trust of
00:11:29: your customer when they find out that you use
00:11:32: private data for another purpose.
00:11:34: So customers give your data also or employees give
00:11:38: the data for a given purpose.
00:11:41: In Europe here, you have to tell people why
00:11:44: you collect the data and what purpose it has.
00:11:48: Then also this data is only for this purpose,
00:11:52: and they use it for another purpose,
00:11:54: then you can use a lot of trust.
00:11:56: Exactly.
00:11:57: So you have to have a right strategy how to use the data.
00:12:01: Exactly. I also see a really big need for explainable AI.
00:12:07: That's also part of the EU AI Act,
00:12:10: so which means that the AI decisions must be
00:12:14: transparent and must be understood by all of us,
00:12:19: so the human people and not only the robots and everything.
00:12:22: So it's like a credit scoring,
00:12:25: AI should explain why a loan was denied,
00:12:28: not just to give a number that you are not good
00:12:31: enough to get whatever bank loan.
00:12:33: So with that explainability,
00:12:35: AI is really like a black box and it can really destroy trust.
00:12:40: If you don't check, what is the outcome and why did you get all this outcome?
00:12:45: Yeah. When it comes to AI agent,
00:12:47: an AI agent can make a decision itself.
00:12:49: So maybe can also decide who to hire,
00:12:53: who to fire or deploy a marketing campaign itself,
00:13:01: and that's what you already think,
00:13:03: where should the human be in loop when it comes to decision?
00:13:08: Can the AI decision make itself?
00:13:11: Is it all is too risky,
00:13:15: then you should have an overview,
00:13:17: and when you let the AI make decision,
00:13:19: then you need a really strict data governance and compliance that you can avoid these risks.
00:13:28: Yeah. How do you see Nadine,
00:13:31: how companies are implementing data governance,
00:13:34: and how do you see it works currently?
00:13:38: It's very different.
00:13:40: It's from bigger companies to smaller companies.
00:13:47: I see really in bigger companies and also it depends on the industry.
00:13:52: So I also work with industry,
00:13:55: they really when it comes to AI act,
00:13:59: they have really restrictions when it comes to AI act like
00:14:03: the insurance or the finance industry or insurance industry,
00:14:07: where we really have personal data of
00:14:11: people for high class insurance or high risky insurances.
00:14:16: So data governance really means having our accountability,
00:14:22: and ownership, and rules of management data,
00:14:25: and it really comes to the industry.
00:14:28: So the industry really there is, okay,
00:14:30: we are aware, we have really private data,
00:14:32: we need a strict governance and their industry,
00:14:35: so they experiment a bit more.
00:14:37: But without the governance,
00:14:39: you also get chaos, duplication on our trust in data.
00:14:43: So you need somebody who says,
00:14:45: okay, we have an AI project and these are
00:14:48: the steps to clean and structure and refine
00:14:50: our data because people don't know how to do that.
00:14:53: So an example framework would be who owns data.
00:14:58: That's one question.
00:14:59: The second question is how is data created,
00:15:02: cleaned and shared, the policies,
00:15:05: privacy, security, and compliance rules,
00:15:08: and then the platform that use the data,
00:15:12: how secure is the data,
00:15:13: and how the data is used in the platforms.
00:15:15: This is really an expert role,
00:15:19: I think in companies, they should have it.
00:15:22: It's a new role when it comes to AI,
00:15:24: that you have a data governance role,
00:15:27: a people who has a data governance and
00:15:30: explain people what to do with their data.
00:15:33: So it's really a new role,
00:15:36: and when it comes to bigger companies,
00:15:38: I see they now implement these rules,
00:15:40: or it's an AI governance,
00:15:42: they take over this role,
00:15:44: and the smaller the companies,
00:15:46: they struggle to maybe this is the CAO.
00:15:48: But there should be one responsible person,
00:15:50: it can be the CAO or it can be
00:15:53: an AI governance or a data governance.
00:15:55: >> Yeah, I totally agree with you.
00:15:57: But also at the other hand,
00:15:59: what I see that many times,
00:16:00: so many users, so many employees says,
00:16:03: "Oh, I rather not touching it,
00:16:06: not to have any issues."
00:16:08: But I think data governance person or
00:16:10: a responsible person should also encourage
00:16:13: people to use it, but in a certain way.
00:16:16: In a certain way, how you can generate value with this,
00:16:19: and also somehow promote the data-driven culture,
00:16:23: that the decisions are based on accurate data,
00:16:26: which brings also things forward.
00:16:30: I think that's also a really important point
00:16:33: at the whole data process.
00:16:35: >> Yeah, and I also see,
00:16:37: "Oh, we don't do this AI project
00:16:39: because we are fed of our data."
00:16:41: >> That shouldn't be the case.
00:16:44: Just take some time,
00:16:46: identify the sources, clean it,
00:16:48: check it, and then really start small with
00:16:52: one focus and build from there.
00:16:55: But don't be afraid to deploy AI projects
00:16:59: because you are fed of your data.
00:17:01: I mean, if you collected it,
00:17:02: it has a purpose.
00:17:04: When the AI delivers this purpose,
00:17:07: you collected the data for,
00:17:08: then I think not that many things can happen.
00:17:12: >> Exactly, I totally agree with you.
00:17:15: Nadine, last but not least,
00:17:17: if you would give companies just one piece of advice
00:17:21: about how to work with data and AI,
00:17:25: then what would you say?
00:17:27: >> Start what you have,
00:17:29: learn from that, improve along the way.
00:17:33: Never wait for perfect data,
00:17:35: just experiment and learn fast.
00:17:38: Also treat data as a product,
00:17:41: not as a by-product of AI.
00:17:43: It's really the most important thing in AI.
00:17:46: It's not a by-product or we need some data
00:17:49: to get the system running because
00:17:50: the system is running out of this data,
00:17:52: so it's a product.
00:17:54: >> Exactly.
00:17:55: >> There are more takeaways than one takeaway,
00:17:58: but it's really difficult to break it down to one sentence.
00:18:04: >> Yeah, I totally understand,
00:18:06: and we have also tons of experience with everything,
00:18:09: so it's just like taking only one piece of advice is challenging indeed.
00:18:15: I would also say that AI is a colleague,
00:18:18: and use it as a robot colleague,
00:18:22: and then also share some data,
00:18:24: but also keep in mind that that's only a colleague,
00:18:28: so not your second body or something.
00:18:31: That's also really important point here.
00:18:34: >> Sofia, many thanks for this very interesting episode,
00:18:40: and I hope it could give you some insight when it comes to data,
00:18:43: what you have to keep in mind,
00:18:45: what to do and what risk it have,
00:18:48: but the message is really identify one focus,
00:18:52: one area, clean your data,
00:18:54: build from there, learn and improve along the way.
00:18:57: I think that's the main message.
00:18:59: >> Exactly. Let's make things sophisticated together.
00:19:02: >> Sophisticated, I like this.
00:19:03: >> Okay. See you. Bye.
00:19:05: >> Thank you.
00:19:06: >> Ciao.
00:19:07: >> Ciao.
00:19:33: [BLANK_AUDIO]
New comment