Confidence Scoring: Extending the MuleSoft AI Chain Connector

June 23, 2025

From Log-Probabilities to Confidence Scores

— a deep-dive, with formulas, intuition, and a real-world “Field Extraction Connector” example

The Problem: Missing Confidence in Current AI Integrations

Here's the challenge: the current MuleSoft AI Chain connector doesn't return confidence scores. When you call OpenAI's API through MuleSoft, you get back the generated text, but zero insight into how "sure" the model was about that response. Even more fundamentally, OpenAI's API itself doesn't expose confidence scores by default. While the underlying models compute log-probabilities for every token, the standard Chat Completions endpoint strips away this crucial uncertainty information before sending you the final answer. This creates a blind spot in enterprise integrations:

You can't distinguish between a model that's 95% confident vs. 45% confident
No way to implement smart fallbacks based on response quality
Manual review processes become all-or-nothing instead of risk-stratified
Debugging model behavior becomes nearly impossible without probability data

That's exactly what this project tackles—bridging the gap between what LLMs know about their own uncertainty and what integration platforms can actually use.

1. Why Confidence Matters in Generative AI

Large language models do not speak in certainties; they speak in log-probabilities—a compressed, negative-logarithmic view of how plausible each token is in context.
If you surface those numbers directly, nobody (except maybe a handful of ML engineers) can interpret them.
What business and platform teams want is a single scalar that answers a simpler question:

“How sure is the model about this particular answer?”

That scalar is the confidence score.
Treat it right, and you unlock:

dynamic fallback (e.g., “ask a human in the loop if confidence < 0.75”)
smarter routing between multiple LLMs
risk-aware caching
explainability dashboards for auditors and customers

2. From Log-Probabilities to Probabilities

The raw output of most decoder-style LLMs is a vector

\boldsymbol{\ell}

of logits. After the softmax we obtain per-token probabilities

p_i

. Libraries often return
logprobs = log(p_i) for numerical stability. Given a token

t

with log-probability

\log p_t

:

p_t = e^{\log p_t} \tag{1}

For a multi-token answer

\mathbf{y} = (t_1,\ldots,t_n)

:

\log P(\mathbf{y}) = \sum_{k=1}^{n} \log p_{t_k}

\Rightarrow P(\mathbf{y}) = \exp\left(\sum_{k=1}^{n} \log p_{t_k}\right) \tag{2}

Sequence probabilities shrink exponentially with length; we therefore need a normalization trick before calling this a “confidence.”

3. Normalizing into a Usable Confidence Score

3.1 Length-Normalised Probability

Divide by length

n

:

\hat{p}(\mathbf{y}) = \left(P(\mathbf{y})\right)^{\frac{1}{n}} = \exp\left(\frac{1}{n}\sum_{k=1}^{n} \log p_{t_k}\right) \tag{3}

Intuition: we’re computing the geometric mean probability per token.

3.2 Softmax-Calibrated Confidence

Empirically you may want to squish probabilities so that they occupy

[0,1]

in a human-meaningful way. A simple temperature-scaled logistic function works:

c = \sigma\left(\frac{1}{T} \cdot (\mu - \beta)\right) \quad\text{with}\quad \mu = \frac{1}{n}\sum_{k=1}^{n} \log p_{t_k} \tag{4}

$T$ – temperature hyper-parameter
$\beta$ – bias term to shift the curve
$\sigma(z)=\frac{1}{1+e^{-z}}$ – logistic

Now

c \in (0,1)

grows smoothly with average token log-prob.

3.3 Entropy-Based Confidence

Another angle: measure uncertainty in the full distribution instead of just the chosen token.

H_t = - \sum_{i} p_{i,t} \log p_{i,t}

Lower entropy ⇒ higher confidence. One can invert and rescale:

c = 1 - \frac{H_t}{\log |V|} \tag{5}

where

|V|

is vocabulary size (for perfect certainty

H_t=0 \Rightarrow c=1

).

4. Case Study: The Field Extraction Connector

Picture an integration platform that needs to fetch invoices from an S3 bucket, parse them through OCR, and extract structured fields—invoice_number, vendor_name, due_date, etc.—via LLM.

4.1 Why Confidence Saves the Day

Downstream contracts
A finance API is allergic to garbage. If the connector ships due_date = "" but claims success, you may silently kill an entire payment run.
Human-in-the-loop escalation
If confidence(invoice_number) < 0.6, route that single field to a clerk in Accounts Payable UI; automate the rest.
Adaptive retry logic
Low confidence? Maybe ask the model again with a few-shot prompt or a domain-specific reranker.

4.2 Putting Equations to Work

Say the LLM proposes
invoice_number = "INV-7429"
with per-token log-probs:

| token | log p |
| ----- | ----- |
| INV   | −0.12 |
| –     | −0.07 |
| 7429  | −0.05 |

Apply eq. (3):

\mu = \frac{-0.12 - 0.07 - 0.05}{3} = -0.08

\hat{p} = e^{-0.08} \approx 0.923

Insert into eq. (4) with

T=0.5

,

\beta=-0.2

:

c = \sigma\left(\frac{1}{0.5}(-0.08 + 0.2)\right) = \sigma(0.24) \approx 0.560 \tag{6}

Threshold is 0.7 → route to human.
Five seconds later a clerk corrects it to INV-7428 and the pipeline proceeds. Without that scalar, you’d either over-trust (ship junk) or under-trust (human in every loop).

5. Calibration and Monitoring in Production

Temperature search: run a validation set with ground-truth flags, pick $T$ , $\beta$ that minimise Brier score.
Drift alerts: if the empirical confidence distribution skews downward over time, maybe your prompt or OCR is degrading.
Explainability: surface token-level contributions (heat-maps) next to the final confidence so end-users aren’t blindsided.

6. Takeaways

Log-probs are powerful but raw—transform them.
Sequence length distorts; use geometric means or entropy.
Calibrate into $[0,1]$ for business consumption.
In connectors like Field Extraction, confidence controls quality gates, cost, and trust.

Treat confidence as a first-class citizen in your LLM stack, and you move from “cool demo” to reliable system.