The MQL Math Problem: How Point Values Get Invented and Why the Thresholds Are Arbitrary


Picture the meeting. Someone from marketing pulled up the slide. "We generated 847 MQLs this quarter." Sales nodded politely; the way people do when they've already decided not to believe something. "How many converted?" Pause. "We may need to revisit the model."
The model doesn't get revisited.
Here is what that number actually represents: 87% of MQLs never become sales opportunities. The model has been running for years. The conversion rate has been roughly the same. What changes, quarter to quarter, is how confidently the deck gets presented.
There are two structural reasons the MQL model is broken and neither of them is "you configured it wrong." The weights going in are guesswork. The number coming out is a headcount limit. This piece breaks down both, and explains why no amount of tuning fixes a model built on those foundations.
Lead scoring isn't complicated in theory. Marketing assigns point values to actions: a pricing page visit might be worth 20 points, a webinar attendance worth 15, an email click worth 5. When a lead crosses a set threshold (say, 70 points) they get routed to sales as an MQL.
Clean. Logical. Completely reasonable-looking on a slide.
The problem is what happens in the room where those numbers get decided.
Here's what that scoring meeting actually looks like.
Someone opens a blank spreadsheet. Someone else suggests that a pricing page visit should be worth more than a blog post. A third person mentions they read that webinar attendance is a strong intent signal. Agreement happens. Numbers get typed in. The model goes live.
As one RevOps practitioner described it: somebody built the model eighteen months ago. They assigned points based on a mix of intuition, sales feedback, and whatever the marketing team believed about their ICP at the time. The scores went live, MQL thresholds were set, and then everyone moved on to the next fire.
That's not a cautionary tale about one bad team. That's the modal experience.
No closed-won data was analyzed. No correlation between these actions and actual revenue was validated. The weights are the team's collective opinion about what should matter, dressed up in a spreadsheet that looks like math.
Industry benchmarks make this worse, not better. Borrowing scoring weights from vendor blog posts means applying values calibrated to someone else's buyer population, product category, and sales cycle. A webinar might be a genuine buying signal for one company and a content-consumption habit for another. The model doesn't know the difference. It just adds points.
No. That's the structural flaw most scoring models never address.
Point values get assigned to actions, not to intent states. A whitepaper download is an action. It can represent a researcher gathering competitive intelligence, a champion building an internal business case, a student writing a paper, a competitor mapping your positioning, or a genuine buyer evaluating your product. The model assigns the same point value to all five, because they all produced the same trackable event: a file download.
The same intent-blindness applies to every signal in a standard model:
When every signal in the model suffers from the same intent-blindness, the score that results is not a measure of purchase readiness. It's a measure of interaction frequency with your marketing content. Those are different things.
The practical consequence shows up in conversion data. Webinar-sourced leads, which most models score generously, convert to opportunities at just 17.8% on average. Event leads convert at 4.2%. Email campaign leads, which accumulate points for every click, convert at 0.9%.
The model is adding points to actions that don't predict buying.
One more pattern worth knowing: prospects who don't convert actually ask more questions than those who do. Curiosity and commitment aren't the same thing. A scoring model can't tell the difference. It just sees activity and adds points.
The standard fix for a low-accuracy scoring model is to improve the scoring. Weight the pricing page higher than the blog post. Add recency decay so an old whitepaper download stops accumulating score. Layer in firmographic fit to build a composite.
These improvements are real. B2B SaaS companies using well-calibrated behavioral scoring models achieve MQL-to-SQL conversion rates in the 39-40% range, meaningfully better than basic demographic scoring.
But the ceiling is structural, not calibrational.
Here's what scoring improvements can and cannot do:
You can weight signals more precisely. You can decay stale engagement. You can filter by ICP fit. What you cannot do inside a behavioral scoring model is replace an inference with a statement. No matter how sophisticated the model, the buyer still never said anything. The model is predicting, with increasing accuracy, what a buyer might be thinking based on what they clicked. It is not capturing what the buyer actually said about their intent, timeline, budget, or use case.
The best-calibrated behavioral model in the world produces a sharper guess. It does not produce a documented qualification record.
Closed-won reverse-engineering is the most sophisticated version of MQL calibration. It's still incomplete, and in a specific way that matters.
The logic: analyze the behavioral signals that appeared in your closed-won leads, build a model that weights those signals more heavily, and produce a scoring model trained on actual purchase outcomes. More data-mature teams do this. Most don't.
The problem: the analysis only includes buyers who filled out a form. The buyers who arrived with high intent, didn't convert, and left are invisible in the closed-won data. You cannot reverse-engineer a signal that was never captured.
The model trains on a biased sample: buyers who tolerated forms. It systematically under-weights the intent signals of buyers who don't tolerate forms, which in B2B in 2026 is an increasingly large portion of high-quality buyers. Factors.ai found that 77% of their highest-value conversations happened outside business hours. Those buyers weren't waiting around for a form-follow-up sequence to catch up with them.
The result is a scoring model that is well-calibrated for the buyers you've historically captured, and completely blind to the buyers you're currently missing.
Here's where the model's second structural problem lives.
Let's say the scoring weights are sorted. The harder question is: where does 70 come from?
When a company decides "a lead needs to score 70+ points to be an MQL," that number almost never comes from studying which buyers actually close. It comes from a simpler question: how many leads can our SDR team realistically call this month?
Say the team can handle 350 leads a month. At a threshold of 60, the model produces 500 MQLs. At a threshold of 70, it produces 350. So the threshold gets set at 70. Not because 70-point leads are meaningfully more ready to buy than 65-point leads. Because 350 is the number the team can physically handle.
The threshold is doing workload management, not quality filtering.
This is the tell: the threshold moves when your team changes, not when your buyers change.
Hire more SDRs and the threshold drops. Downsize and it goes up. The buyers behaved exactly the same way throughout. Nothing about their readiness changed. Only the team's bandwidth did.
One practitioner on r/b2bmarketing described their threshold plainly: "The MQL threshold you have set essentially indicates that a person has an email address and hasn't unsubscribed right away." That's not snark. It's an accurate description of what a low-calibration scoring model actually selects for.
Only 27% of marketing-generated leads ever get contacted at all. That's not a quality problem. It's a volume management problem. The threshold was set to protect the team's calendar, and even then, three-quarters of the leads it lets through never get a call.
The most useful reframe isn't about scoring weights. It's about what kind of signal you're even capable of capturing inside a behavioral model.
Rule-based scoring reflects buyer behavior at the moment it was built, and then it stops learning. ICP shifts. Products expand. Competitive dynamics change. The personas buying today may not look like the ones buying 18 months ago. None of this updates the model automatically. Someone has to go back in and rebuild the weights. Most teams don't. Best practice guidance recommends reviewing scoring models every three to six months; the fact that this has to be recommended implies that it's perpetually on the roadmap and never quite on the calendar.
What actually separates a buyer from a browser isn't the volume of activity. It's what they said inside it.
Across 4,736 real buyer conversations, the widest behavioral gap in the dataset was here: in conversations that ended with a qualified contact capture, 91% included a concrete next step. In conversations that didn't convert, that number dropped to 13%.
The difference wasn't how much activity happened before the conversation. It was whether the conversation produced forward motion. That's something a scoring model cannot measure by design. It counts events. It doesn't understand what happened inside them.
The same dataset found that the deepest 12% of conversations, those lasting five minutes or longer, generated 30% of all captured pipeline. Volume and depth don't move together.
For a scoring model to approach this kind of signal fidelity, it would need to capture not just what a buyer clicked but what they said. A behavioral model can't do that. That's the ceiling, and it's structural.
Qualification is not a score. It's a conclusion drawn from a conversation.
The structural difference matters. A scoring model measures events and infers intent from their accumulation. A conversation-based qualification process asks and adapts. It surfaces what the buyer is actually trying to solve, identifies the constraints that will determine whether a deal closes, and handles the objection that would otherwise send them quietly to a competitor. Those are fundamentally different operations. Tuning the scoring model doesn't close the gap between them.
For a threshold to reflect actual purchase intent rather than engagement volume, it needs to be anchored to something the buyer said or confirmed, not just something they clicked.
The practical version of this, short of any tooling change: run a closed-won analysis on a narrow dataset. Pull the last 30 closed deals. For each one, identify the first moment in the record where the buyer's stated intent was captured. Not a page visit. A response to an SDR email, a question they asked on a call, a note from the first conversation.
Then ask: what did that buyer say or ask that the average MQL did not?
For most teams, the answer is concrete. The buyers who closed had already articulated a specific use case, a timeline, or a named evaluation criteria before the SDR even called. The buyers who didn't close arrived as a name and a score, and the qualification had to start from zero.
That's the threshold problem. And it's one the point model cannot solve, because the point model doesn't capture the conversation.
The MQL scoring model isn't broken because someone set the weights wrong. It's broken because the math was always approximate, and the threshold was always about capacity. Both are structural, not configurational.
Most demand gen leaders already know this. The model persists not because it works, but because it's the thing everyone agreed on, and changing it requires a harder conversation than optimizing it. The CFO built budget models around MQL volume. The CRM is wired to it. The QBR slide is ready to go.
But the cost of not changing shows up in the same place, every quarter: a pipeline that looks like it's working until sales opens the queue.
Docket is the Agentic Marketing platform for B2B revenue teams. Its AI Marketing Agent opens a real conversation, answers from your approved product knowledge, qualifies intent in real time, and delivers an AQL to your rep. See how at docket.io.