SPM Blog 43: Human Judgment Required – The Status of AI and Its Implications for Product Managers (and everybody else)

SPM + AI
AI tools based on Large Language Models (LLMs) have become ubiquitous over the last three years. While initial use tended to be naive and exploratory, use in professional environments is becoming standard and can help product managers a lot. In this post, we look at recent research results on AI, ideas on AI as game changer for industries, severe limitations of LLM-based AI products, and what this means for practice and education of product managers. Summary: Use them responsibly, but don’t trust them. Human judgment is needed more than ever.

Since OpenAI launched ChatGPT in November 2022, large language models (LLMs) have become almost synonym to the term AI in the public view. Since then a number of additional LLM-based tools like Gemini, Copilot, Claude, Perplexity etc. have been launched and released in new versions and releases quite frequently, some of them free, some at a price. We all have experimented with them, and most people find them useful. The outputs often seem to be surprisingly good, but we are biased. When we ask a question via prompt, it is usually about a subject that we do not know much about. So if the resulting output looks plausible we consider it as useful and think it is correct. Subject matter experts are frequently more critical. The quality of outputs is dependent on the quality of the data that was used for the learning process of the LLM.

We are seeing significant productivity increases in a number of areas like software development, text creation and improvement and many more. Vendors continue to be super enthusiastic and invest enormous amounts of money, in particular into infrastructure, without having a proven business model. Market evaluations are also enormous. However, B2B customers are experiencing high failure rates of AI projects and unconvincing ROIs.

Gartner positioned Generative AI (LLMs are part of this category) as going towards the „trough of disillusionment“ in their latest Gartner AI Hype Cycle (Gartner, June 2025). In other words, in spite of all the hype we have learned that LLMs do not solve all the problems of the world. That is reflected in quite a number of recent publications, both from researchers and practitioners.

ISPMA Fellow Sangeet Paul Choudari published a book in August 2025 entitled “Reshuffle“. He claims that AI is not just a productivity tool. Its real power comes from structural changes in companies or even in whole industries that are based on AI’s coordination capabilities. There are no implementations yet for this vision, but when you look at a tool that was launched last November under the name Clawedbot and which is now called OpenClaw, you can already envision what this looks like. OpenClaw is intended as a personal assistant that runs on your PC or laptop. You give all access rights to OpenClaw, specify which LLMs it should use, and then OpenClaw acts on your behalf. It manages all your messages through all channels, it answers messages, it pays invoices if you give it access to your account or credit card. It installs software, all of that autonomously. It’s quite amazing to experience this, and you can easily extrapolate the experience of this tool into a corporate environment like Choudari describes it. Whether you really want to give so much control over your life to such a software tool is a question that you need to answer for yourself.

One thing we have learned is that LLMs hallucinate, i.e. produce outputs that are plausible but factually incorrect, inconsistent, or well fabricated. In my software life, that is close to the definition of a defect, a bug that needs to get fixed. “Hallucination” sounds much better, more human, more intelligent. Why was a new term needed? That question is answered by Sourav Banerjee, Ayushi Agarwal & Saloni Singla (2025) and Michał P. Karpowicz (2025). They provide mathematical proof that hallucinations are inherent to the LLM technology. It is not a bug, it is a feature, unwanted, but not fixable, even with the best training data.

You may say, well, this is just theory. As long as hallucination rates are lower than human error rates, there is no problem. So we need to look at numbers. On the internet you can find benchmark results for hallucination rates. One is published by Vectara on GitHub. Their benchmark is fairly simple. It’s just about summarizing a short document. Even for this simple case, they come up with hallucination rates between 4 and 8% for the top 25 LLMs. These numbers have not improved with new versions of the LLMs. For more complex benchmarks, you can find much higher hallucination rates, i.e. in the range of 40 to 70 percent, clearly higher than human error rates.

Recommendation: given the enormous productivity boost that you can get from these tools, use them, but use them responsibly. That means make sure that the tool and its underlying data are a good fit to the problem or question that you are facing. And don’t trust the outputs. This technology is inherently error-prone. Human judgment is required, in particular, when you apply these tools in critical application areas like healthcare, finance, or legal. Nevertheless, LLM-based AI tools can help product managers with most of their tasks.

If you are a product manager of an LLM-based product you need to be aware of the limitations of the technology and the risks. And you need to take the customers‘ perspective into account. B2B customers are increasingly aware of the risks that come with AI technology. The large global insurance group Allianz has just published its “Risk Barometer 2026“ which is based on a global survey. In B2B customers’ view, AI has gone up from # 10 to # 2 in the list of the most critical risk areas (#1 being cybersecurity risks). For AI, customers see operational risks like business interruption, and in particular errors cascading through automated workflows. They see legal and compliance risks, in particular liability for harmful outcomes. And they see reputational risks, in particular through unethical AI use. If you want your product to be successful, you need to address these customer concerns proactively.

The risk aspects become even more important when the product is intended to be an autonomous AI Agent. For more critical application areas, you better reduce the level of autonomy and keep the human in the loop for decision-making.

As a product manager, you also need to influence and monitor what software developers are doing. LLMs can give software development a significant productivity increase. There is one approach called Vibe Coding. That means that software code is generated based on natural language descriptions. ISPMA fellow Frédéric Pattyn has just published the book “The Vibe Coding Trap“. In this book he states that product managers can succeed when they treat Vibe Coding as a learning amplifier, but not as a delivery guarantee. With the increased speed of development, decision making becomes the bottleneck. Teams can gain remarkable improvements in time to value, but they pay with significant issues regarding correctness, completeness, maintainability and understandability. The price for the amazing speed can easily be significant technical debt that fires back later. The book gives practical advice how to deal with these challenges.

In November 2025, Andrew Ng, professor of computer science at Stanford and former head of AI with Google, gave career advice in AI to his computer science students. He stated that the ratio between product managers and software engineers is changing. Historically, it was 1 to 8. It is currently going to 1 to 4 with the perspective of 1 to 1. So product management becomes the bottleneck. If we manage to teach software engineers product management, maybe only one person will be needed in the future. But Andrew Ng reported himself that at Google he failed when he tried to turn software engineers into product managers.

Product management education needs to focus on the use of LLM-based AI, in particular the applicability of an AI tool and its underlying data to a question or problem, and on judgment, i.e. the ability of the product manager to decide if the output of the tool is reasonable, realistic, and useful. The second focus area is the management of LLM-based products, in particular the business perspective, LLM technology, data science, and risk assessment and management.

Globally, companies have significantly reduced hiring juniors over the last 2 years. That has happened under the assumption that junior tasks will be done by AI as a team member in the future, with required judgment coming from senior experts. This is obviously not a sustainable approach. Where should future senior experts come from if companies do not hire juniors today? My recommendation is: start hiring juniors again and develop them into knowledgeable and responsible users of AI tools. Make them work with senior experts on the judgment part so that they can learn. And establish governance rules for the company that help reduce AI-induced risks.

To summarize: LLM-based AI products can generate significant productivity increases, but they come with severe limitations and risks. Human judgment is required. The human being is needed in this more than ever. That is actually good news, for product managers and everybody else.

References:

Gartner AI Hype Cycle 2025: https://www.gartner.com/en/newsroom/press-releases/2025-08-05-gartner-hype-cycle-identifies-top-ai-innovations-in-2025

Sangeet Paul Choudari: Reshuffle – Who wins when AI restacks the knowledge economy, 2025: https://platforms.substack.com/p/reshuffle-my-next-book-is-now-available

Sourav Banerjee, Ayushi Agarwal & Saloni Singla (United We Care, Los Angeles): LLMs Will Always Hallucinate, and We Need to Live with This, in: Arai, K. (eds): Intelligent Systems and Applications. IntelliSys 2025. Lecture Notes in Networks and Systems, vol 1554. Springer August 2025; preprint https://arxiv.org/pdf/2409.05746

Michał P. Karpowicz (Samsung AI Center, Poland): On the Fundamental Impossibility of Hallucination Control in Large Language Models, May 2025; https://arxiv.org/pdf/2506.06382

Vectara Hallucination Leaderboard: https://github.com/vectara/hallucination-leaderboard/

Allianz Risk Barometer 2026: https://commercial.allianz.com/content/dam/onemarketing/commercial/commercial/reports/allianz-risk-barometer-2026.pdf

Frédéric Pattyn: The Vibe Coding Trap: How AI Accelerates Delivery – and Quietly Breaks Responsibility, 2026: https://www.amazon.com/dp/B0GGC668P5/

Andrew Ng: Career Advice in AI, Nov. 2025: https://www.youtube.com/watch?v=AuZoDsNmG_s (starting at 5:30)

Special thanks for helpful feedback and discussions with Prof. Rahul De’, and ISPMA Fellows Frédéric Pattyn, Vibhuti R. Singh, Gerald Heller, and Andrey Saltan.

The content of this article is the subject of my presentation at ISPMA’s SPM Summit India 2026 at IIM Bangalore on Feb. 21, 2026: https://spmsummit.org/india

For my training courses, consulting and my books on software product management, see www.innotivum.com.