The Challenge of AI is Reliability

AI is everywhere now, from chatbots to self-driving cars; it's on everyone's lips. On LinkedIn, the number of AI experts has skyrocketed, even if their so-called expertise is often limited to knowing how to use ChatGPT, Midjourney, or any other AI tool. The hype is strong, which is good for the AI industry, but it also creates a lot of confusion. People think AI is magic—that it can solve any problem, that it can do anything. But the reality is that AI is not magic; it's just a tool, and like any tool, it has its limitations.

The challenge is to integrate AI without compromising the overall reliability of the system. Why? Think of AI models, especially those based on deep learning approaches, as probabilistic black boxes. They are trained on data, and they make predictions based on that data. Every prediction they make comes with a certain level of uncertainty, inherent to the model. That's why you will often hear terms like accuracy, precision, recall, F1 score, etc. These metrics tell us how reliable a model is, how well it performs on a given task. Most of these metrics are evaluated on a test set, a subset of the data the model has never seen before, to ensure that the model generalizes well. It's designed to prevent the model from merely memorizing the training data and to actually learn from it. However, even with all these precautions, there is always a risk that the model will fail in production, making an unforeseen mistake. Because the real world is messy, and unless you train your model on all possible scenarios—which is impossible, as the number of scenarios is infinite—there will always be some cases where the model will fail.

The challenge is to integrate AI without compromising the overall reliability of the system. For example, McDonald's recently paused its AI-driven drive-thru order handling—temporarily—after social media reported bizarre mistakes. The issue? Reliability. The AI was not reliable enough to handle the orders.

Reliability vs Cost Trade-off

When you use an AI model that is right 70% of the time, for example, you must evaluate the cost of the 30% of the time when it's wrong. In some cases, the cost of failure is low, and you can afford to use a less reliable model. For instance, if you're using AI to recommend movies on Netflix and the model makes a mistake, it's not a big deal—you can just watch another movie. However, in other cases, the cost of failure is high, and you need a more reliable model. For example, if you're using AI to diagnose cancer and the model makes a mistake, it becomes a matter of life and death. Therefore, you need a model that is as reliable as possible.

Let’s consider large language models (LLMs), arguably the most popular AI models today. They are trained on massive amounts of data and can generate human-like text. However, it is difficult to evaluate their accuracy. The industry uses benchmarks to check common sense reasoning, factuality, etc. But these benchmarks are not perfect—they don't cover all possible scenarios, and more importantly, it is often difficult to assert with 100% certainty that these test sets are not contaminated. Even with that in mind, there is consensus that LLMs are unreliable as they can hallucinate—generate facts or information that are not true. This issue is not significant if you are using LLMs to rephrase a sentence, but it becomes critical if they are used to generate news articles, for example. While having access to additional context via retrieval augmentation can help mitigate this issue, it is not a silver bullet. The reliability issue remains.

How to Deal with the Challenge of AI Reliability

So, how do we deal with this challenge? There are several ways to do it. One way is to use multiple models, each trained on a different subset of the data. This method, known as ensemble learning, is a common technique in machine learning. By combining the predictions of multiple models, we can reduce the risk of failure, because even if one model fails, the others might still make a correct prediction. Another approach is to implement human oversight, especially in critical applications like healthcare or autonomous driving. Humans can provide a sanity check; they can ensure that the model is not making any obvious mistakes. And if the model does make a mistake, humans can intervene and correct it. This approach is called human-in-the-loop AI, and it's a growing field in AI research. Finally, we can also use techniques like uncertainty estimation to quantify the uncertainty of a model. This can help us identify cases where the model is unsure about its prediction and act accordingly. For example, if the model is uncertain about a diagnosis, we can ask a human expert to provide a second opinion.

Always consider the cost of failure. If the cost is high, you need a more reliable model. If the cost is low, you can afford to use a less reliable model. This is particularly important when deploying your AI model as part of a larger system. Think like this: What happens to my workflow if the AI model fails 20-30% of the time? Can I afford it? What are the consequences for my business? For my customers? For my reputation? These are important questions to ask yourself before deploying your AI model, and they will also determine the user interface you will design. For example, if the cost of failure is high, you may want to design your system so that a human expert can intervene if the model fails. Or, you might inform the user about the model's limitations, as seen in ChatGPT.

The cost of ensemble learning can be high, as it requires multiple models to be trained and maintained. Human oversight can also be expensive, as it necessitates human experts to be on call 24/7. Uncertainty estimation is a promising technique, but it's still in its infancy, and it's not yet clear how well it works in practice. Thus, there is no silver bullet here, no one-size-fits-all solution. The challenge of AI is reliability, and it's a challenge that we must face head-on if we want to build AI systems that are safe and trustworthy.