Frekil Logo
AI in Pharma

Why Real-World Evidence Needs More Than a Chatbot

NT

Nikhil Tiwari

·6 min read

Many pharma teams have tried ChatGPT, Claude, and AI copilots for real-world evidence. The experience is usually the same: the model is impressive, the draft looks good, and then the trust breaks.

That is the real problem.

Life sciences teams do not need another demo where an AI system sounds fluent. They need evidence they can defend. They need to know where a statement came from, what assumptions were made, which data was used, what changed during review, and whether the final answer can survive scientific, regulatory, and payer scrutiny.

General-purpose LLMs can help with parts of that work. They can read dense documents, summarize protocols, draft code, explain methods, clean up writing, and turn complicated analysis into clearer language. But real-world evidence is not just a writing task. It is a trust task.

That is why ChatGPT and Claude are not enough on their own.

The issue is not whether LLMs are impressive. The issue is whether a pharma team can trust, review, and defend what they produce.

Pharma teams have already felt the limits

Most people working in RWE, HEOR, medical affairs, safety, or clinical development have had the same experience by now.

You ask an LLM to help with a study question. It gives you a confident answer. You ask for citations. Some are useful, some are weak, and some need to be checked line by line. You ask it to refine the analysis. It gives you something plausible, but you are not fully sure whether it preserved the original assumptions. You ask it to draft the next step. It may produce a good starting point, but the subtle logic still needs human review.

Then the conversation gets long. The model starts forgetting earlier details. It mixes versions of the study question. It sounds equally confident when it is right and when it is guessing. The output looks finished before the work is actually defensible.

That is context rot. In a normal business workflow, it is annoying. In real-world evidence, it is dangerous.

RWE studies depend on small details: population, time window, comparator, outcome, follow-up, missingness, assumptions, and interpretation. If any of those drift, the answer can change. A model that loses track of the study can create work that looks polished but is scientifically fragile.

Hallucination is the other obvious failure. LLMs are trained to produce likely language, not verified truth. In life sciences, a plausible-sounding claim is not enough. A wrong endpoint, a loose citation, an invented statistic, or an overconfident clinical statement can damage trust quickly.

Pharma teams do not reject AI because they are conservative. They reject AI when it asks them to trust a black box in a workflow where accountability matters.

Why many AI products have disappointed

The first generation of AI products in this space often made the same mistake: they treated the LLM as the product.

The pitch was simple. Upload documents. Ask questions. Generate outputs. Move faster.

That can be useful for search, summaries, first drafts, and internal productivity. But it does not solve the harder problem of evidence generation. The hard part is not making text appear on a page. The hard part is proving that the text is grounded, the analysis is reproducible, and the assumptions are visible.

Pharma teams have seen tools that can summarize papers but cannot explain whether the underlying evidence applies to a specific decision. They have seen copilots that can generate code but cannot guarantee the analysis logic matches the protocol. They have seen chat interfaces that answer confidently but do not leave behind the audit trail a reviewer needs.

The result is predictable. The tool gets used for low-risk work, but the critical RWE workflow stays manual. Teams still fall back on long documents, spreadsheets, meetings, outsourced analysis, and repeated review cycles because those are slow but familiar. They create a trail people can inspect.

That is the bar AI has to meet.

What LLMs should do

We believe LLMs are useful when they are grounded, constrained, and reviewed.

They should help teams reason through a study question. They should surface ambiguity early. They should turn a messy starting point into a clearer protocol draft. They should translate dense scientific inputs into a narrative that different stakeholders can understand. They should help write and test code when the expected behavior is explicit. They should make review easier by showing what changed and why.

In other words, LLMs should be accelerators for expert work.

They are especially valuable when they help humans see the structure of a problem. What decision is this evidence meant to support? What needs to be true for the answer to be trusted? Which assumptions are still unresolved? Which claims need a source? Which parts of the workflow require clinical or statistical judgment?

That is where language models are strong. They can organize complexity. They can make implicit decisions explicit. They can turn a dense evidence package into something easier to inspect.

What LLMs should not do

LLMs should not invent medical facts. They should not rely on memory for citations. They should not quietly change the research question halfway through a workflow. They should not decide what the analysis should have been after seeing the result. They should not turn an uncertain finding into a confident conclusion.

They also should not replace the people responsible for the evidence.

Real-world evidence requires judgment. Clinical context matters. Data quality matters. Statistical assumptions matter. Regulatory expectations matter. Payer interpretation matters. A system that hides those decisions behind a friendly chat interface is not making the work safer. It is making the risk harder to see.

The right role for AI is not "autonomous analyst." The right role is a governed co-pilot: fast, useful, grounded, reviewable, and honest about uncertainty.

What we are building at Frekil

This is what we do at Frekil.

We are building AI infrastructure for real-world evidence that starts from a simple belief: pharma teams will only trust AI when the work is inspectable.

That means the system has to understand the difference between drafting and deciding. It has to know when to help write, when to retrieve evidence, when to ask for clarification, when to preserve a decision, and when to route something back to a human expert.

It also means the workflow cannot depend on one long chat that tries to remember everything. Evidence work needs structure. A study question should be carried forward carefully. Assumptions should remain visible. Reviewers should be able to see how an output was produced. Changes should not disappear into a conversation history.

Frekil is designed around those principles. We use AI to speed up the repetitive and coordination-heavy parts of RWE while keeping expert review at the center. The goal is not to make evidence look automated. The goal is to make evidence work faster, cleaner, and easier to trust.

We are not trying to replace the RWE expert. We are building the system that lets experts move faster without losing control of the evidence.

The trust layer matters

In pharma, trust is not a brand claim. It is an operating requirement.

A clinical team needs to trust that the study question did not drift. A biostatistician needs to trust that the analysis matches the plan. A regulatory lead needs to trust that claims are supported. A payer team needs to trust that real-world results are not overextended. A reviewer needs to understand what changed between versions.

This is why generic AI feels insufficient. It can be brilliant in a moment and unreliable across a workflow. It can produce a beautiful paragraph while hiding weak assumptions. It can help a single person move faster but still fail to create the shared trail a cross-functional team needs.

RWE is collaborative by nature. The work moves across clinical, epidemiology, statistics, data, regulatory, medical, HEOR, and commercial teams. Each group needs a different view of the same evidence. AI has to support that collaboration, not flatten it into a chat transcript.

That is the opportunity.

The future is not one magic model

The future of AI in RWE will not be one model that knows everything and does everything.

It will be systems that know their limits. Systems that use language models where language models are strong. Systems that connect to verified sources instead of guessing. Systems that preserve decisions. Systems that keep humans in the loop for the parts that require judgment. Systems that make the evidence trail easier to review, not harder.

This is the direction we believe the field is moving. More teams will use LLMs, but the winning products will not be the ones with the flashiest chat interface. They will be the ones that pharma teams can actually trust inside high-stakes evidence workflows.

That is what we are building at Frekil: AI for RWE that respects the work, respects the experts, and understands that in evidence generation, speed only matters if the answer can be defended.