Asia/Manila
Projects

Confidence Scoring: Extending the MuleSoft AI Chain Connector

image
June 23, 2025
— a deep-dive, with formulas, intuition, and a real-world “Field Extraction Connector” example Here's the challenge: the current MuleSoft AI Chain connector doesn't return confidence scores. When you call OpenAI's API through MuleSoft, you get back the generated text, but zero insight into how "sure" the model was about that response. Even more fundamentally, OpenAI's API itself doesn't expose confidence scores by default. While the underlying models compute log-probabilities for every token, the standard Chat Completions endpoint strips away this crucial uncertainty information before sending you the final answer. This creates a blind spot in enterprise integrations:
  • You can't distinguish between a model that's 95% confident vs. 45% confident
  • No way to implement smart fallbacks based on response quality
  • Manual review processes become all-or-nothing instead of risk-stratified
  • Debugging model behavior becomes nearly impossible without probability data
That's exactly what this project tackles—bridging the gap between what LLMs know about their own uncertainty and what integration platforms can actually use.
Large language models do not speak in certainties; they speak in log-probabilities—a compressed, negative-logarithmic view of how plausible each token is in context.
If you surface those numbers directly, nobody (except maybe a handful of ML engineers) can interpret them.
What business and platform teams want is a single scalar that answers a simpler question:
“How sure is the model about this particular answer?”
That scalar is the confidence score.
Treat it right, and you unlock:
  • dynamic fallback (e.g., “ask a human in the loop if confidence < 0.75”)
  • smarter routing between multiple LLMs
  • risk-aware caching
  • explainability dashboards for auditors and customers

The raw output of most decoder-style LLMs is a vector \boldsymbol{\ell} of logits. After the softmax we obtain per-token probabilities pip_i. Libraries often return
logprobs = log(p_i) for numerical stability.
Given a token tt with log-probability logpt\log p_t: pt=elogpt(1)p_t = e^{\log p_t} \tag{1} For a multi-token answer y=(t1,,tn)\mathbf{y} = (t_1,\ldots,t_n): logP(y)=k=1nlogptk\log P(\mathbf{y}) = \sum_{k=1}^{n} \log p_{t_k} P(y)=exp(k=1nlogptk)(2)\Rightarrow P(\mathbf{y}) = \exp\left(\sum_{k=1}^{n} \log p_{t_k}\right) \tag{2} Sequence probabilities shrink exponentially with length; we therefore need a normalization trick before calling this a “confidence.”
Divide by length nn: p^(y)=(P(y))1n=exp(1nk=1nlogptk)(3)\hat{p}(\mathbf{y}) = \left(P(\mathbf{y})\right)^{\frac{1}{n}} = \exp\left(\frac{1}{n}\sum_{k=1}^{n} \log p_{t_k}\right) \tag{3} Intuition: we’re computing the geometric mean probability per token. Empirically you may want to squish probabilities so that they occupy [0,1][0,1] in a human-meaningful way. A simple temperature-scaled logistic function works: c=σ(1T(μβ))withμ=1nk=1nlogptk(4)c = \sigma\left(\frac{1}{T} \cdot (\mu - \beta)\right) \quad\text{with}\quad \mu = \frac{1}{n}\sum_{k=1}^{n} \log p_{t_k} \tag{4}
  • TT – temperature hyper-parameter
  • β\beta – bias term to shift the curve
  • σ(z)=11+ez\sigma(z)=\frac{1}{1+e^{-z}} – logistic
Now c(0,1)c \in (0,1) grows smoothly with average token log-prob. Another angle: measure uncertainty in the full distribution instead of just the chosen token. Ht=ipi,tlogpi,tH_t = - \sum_{i} p_{i,t} \log p_{i,t} Lower entropy ⇒ higher confidence. One can invert and rescale: c=1HtlogV(5)c = 1 - \frac{H_t}{\log |V|} \tag{5} where V|V| is vocabulary size (for perfect certainty Ht=0c=1H_t=0 \Rightarrow c=1).
Picture an integration platform that needs to fetch invoices from an S3 bucket, parse them through OCR, and extract structured fieldsinvoice_number, vendor_name, due_date, etc.—via LLM.
  1. Downstream contracts
    A finance API is allergic to garbage. If the connector ships due_date = "" but claims success, you may silently kill an entire payment run.
  2. Human-in-the-loop escalation
    If confidence(invoice_number) < 0.6, route that single field to a clerk in Accounts Payable UI; automate the rest.
  3. Adaptive retry logic
    Low confidence? Maybe ask the model again with a few-shot prompt or a domain-specific reranker.
Say the LLM proposes
invoice_number = "INV-7429"
with per-token log-probs:
| token | log p |
| ----- | ----- |
| INV   | −0.12 |
| –     | −0.07 |
| 7429  | −0.05 |
Apply eq. (3): μ=0.120.070.053=0.08\mu = \frac{-0.12 - 0.07 - 0.05}{3} = -0.08 p^=e0.080.923\hat{p} = e^{-0.08} \approx 0.923 Insert into eq. (4) with T=0.5T=0.5, β=0.2\beta=-0.2: c=σ(10.5(0.08+0.2))=σ(0.24)0.560(6)c = \sigma\left(\frac{1}{0.5}(-0.08 + 0.2)\right) = \sigma(0.24) \approx 0.560 \tag{6} Threshold is 0.7 → route to human.
Five seconds later a clerk corrects it to INV-7428 and the pipeline proceeds.
Without that scalar, you’d either over-trust (ship junk) or under-trust (human in every loop).
  • Temperature search: run a validation set with ground-truth flags, pick TT, β\beta that minimise Brier score.
  • Drift alerts: if the empirical confidence distribution skews downward over time, maybe your prompt or OCR is degrading.
  • Explainability: surface token-level contributions (heat-maps) next to the final confidence so end-users aren’t blindsided.

  1. Log-probs are powerful but raw—transform them.
  2. Sequence length distorts; use geometric means or entropy.
  3. Calibrate into [0,1][0,1] for business consumption.
  4. In connectors like Field Extraction, confidence controls quality gates, cost, and trust.
Treat confidence as a first-class citizen in your LLM stack, and you move from “cool demo” to reliable system.