Everyone’s talking about agents, but most of the conversation fixates on the model: which frontier LLM is newest, smartest, or cheapest. Matt Neligeorge, Bobsled’s Chief Scientist, argues the real leverage is in the scaffolding around the model: the harness, the context layer, and the systems that let an agent learn from its own usage.
We sat down with Matt to talk through why agentic analytics is uniquely hard, why a good harness beats a better model, and what Bobsled is learning as it builds agents that get better every time someone uses them. The conversation has been edited for length and clarity.
Let’s start with the hard part. Why is agentic analytics so difficult?
Agentic analytics is hard because analytics is hard. Analytics writ large involves tackling really open-ended, challenging problems. A question like “how many sales did we have last month?” is pretty straightforward — it doesn’t take much analysis to collect that data and put it in a table or a chart. But analytics doesn’t stop there. There’s a much broader spectrum of complexity, nuance, and ambiguity in the questions people actually have, where it’s often hard to even verify: was this correct?
It also involves a lot of different probes for information. Where do you get the context to answer a question? It could come from the data itself, running one query or many. It could come from documentation or metadata — what does this data mean? What does this table’s list of users mean compared to that other table’s list of customers? There’s often a complex variety of data and metadata sources that need to be brought in, and it can be a mix of structured and unstructured data. Keeping all of that working smoothly — building search that retrieves the right information at the right time, and verifying that every step in the chain is working — adds a lot of complexity to keep running.
There’s a lot of noise about each new model release. But you’ve said the harness matters more than the model. Why?
An agent today running a model from 15 months ago — like Sonnet 3.7 — would outperform a simple linear pipeline or one-shot LLM call using the latest models like Opus 4.7. That’s true because an agent, even in a simple graph structure, is so much more capable than a one-off or a prescriptive workflow that says: LLM call, branch on a condition, another LLM call, print the result.
Good analysis usually requires iteration — proposing counterfactuals, challenging your assumptions, bringing in caveats. All of that is hard to capture in a single plan up front that lays out every step you’ll take and then declares you’re done. Complex problem-solving is iterative by nature: you try something, see if it works, try something different, see if it works better. That’s core to the scientific method, or any agile process. It’s fundamentally different from a static, plan-everything-ahead workflow, where you don’t see nearly as much benefit from the loop.
Once the initial “wow” wears off, what do people actually complain about?
As people acclimate to working with agents, there’s usually an initial wow moment — “this can do more than I thought it could.” But the first couple of complaints tend to go in two directions. One is: “I told you yesterday this was wrong, and you’re making the same mistake again today — why can’t you remember that?” It’s frustrating to work with someone who makes the same mistake twice, especially after you’ve corrected them. The second is the opposite: “You did this thing yesterday that I really liked — why aren’t you doing it the same way today?” Sometimes you want the agent to learn and change, and sometimes you want it to be repeatable.
Those are two avenues we’ve been pursuing. The first is learnings and background agents — addressing that case of not making the same mistake twice, and improving with usage. Every interaction between an agent and a human is an opportunity for the agent to learn. Right now we’re focused on using background agents and learnings to improve the context layer, so that subsequent conversations have better context about the data and the problem.
The other side is repeatability. You co-develop a workflow, or you have a business process you just want the agent to run the same way next week — the same usage report, done exactly the same way. That’s what we’re building into workflows: a stricter recipe, so when the agent is asked to do something, it follows those steps the same way every time, instead of starting fresh and re-exploring how it solved it last time.
You ran a benchmark to measure whether that learning system actually works. What did you find?
We wanted to assess what impact the learning system and background agents were having on the end result — the work the analysis agent actually does. So we set up a test: a benchmark set of analytics questions on a dataset, scored across a few different cases.
In the first case, the agent just has access to the raw data. It does really well on simple questions, but as you get to more complex, nuanced, ambiguous questions, the scores drop. Hard questions are still hard for agents; simple ones they handle really well right now. So the interesting space is those harder questions — how do we improve them?
In the second case, we gave the agent the raw data plus our generated context layer — what you get when you onboard a dataset and talk to a context agent: a semantic model, data dictionaries, relationships, metrics, descriptions of what the data means. That’s usually done up front, as a one-off, at onboarding.
In the third case, we simulated usage. We asked a bunch of questions and had our background agents review what happened — how SQL queries failed, how long things took to run — and use those signals to improve the context layer. There we started to see significant improvements in overall accuracy, especially on the harder questions. The background agents can pick up things like, “this question got tripped up by a mistaken join,” or “there were duplicates in this dataset that weren’t caught earlier” — and store a learning in the context layer so we don’t make that mistake next time. We’re seeing that reflected in quantitative accuracy scores, and in some cases reduced latency when the data modeling is done well. It’s a really promising sign for the value of what we’re building in this learning system.
Where does this go next?
We started with data sharing — making sure people have access to the right data. Easing analysis with analytics agents helps more people get value out of that data by lowering the barrier to asking questions and getting answers.
As we think about learning as a system, we’re looking at the full stack, from data production to where it gets used — and that feedback cycle can go pretty far into the internals of the data pipeline and lineage. A lot of what we’ve been thinking about recently is the wider set of tasks upstream of the analysis, in the data engineering space, that can be automated and optimized. As we see real usage and get feedback from the end use case, can we improve the speed, quality, and reliability of all the upstream pipelines, from the source to the consumer?
So far we’ve focused mostly on the context layer. We’re starting to look closer to the data layer — data models, data quality monitoring. Where we haven’t focused much attention yet is the agent layer itself: improving prompts, and improving the agent’s behavior. The goal is to have the full stack — from data, to context, to the reasoning layer — all being evaluated and enhanced through usage.
Want to see what agentic analytics looks like on your own data? Talk to our team about putting Bobsled to work.

