Morning all. Wanted to flag that Pennington's quarter-end is approaching — David's team will be running larger-than-usual BACS batches on March 30 and 31. I've confirmed with Tom that our settlement window can handle the volume, but wanted everyone in the loop ahead of time.
Noted. I'll keep a close eye on the settlement engine during those windows. We handled similar volumes in December without issue but worth monitoring given the scale. Will flag if anything looks off.
Had a good catch-up with David this afternoon. He's generally happy with how things are running. He did bring up the Stripe outage from February — asked whether we've made any progress on the Meridian failover configuration. I told him the team was working on it. @ryan-kelly — is there anything I can share with him? Even a rough timeline would help. He's not pushing hard but it's clearly on his radar.
still on the backlog. hasn't been prioritised yet.
Understood. I'll manage David's expectations for now — but if that changes, let me know so I can update him.
Confirming the quarter-end batch schedule with Pennington. David's team will submit their large BACS batch on the 30th by 2pm and on the 31st by 1:30pm. They've agreed to send us a heads-up when each file is queued so we can monitor the processing window in real time.
David's also asked whether we can provide a quarterly settlement summary report broken down by payment rail — BACS, FPS, and CHAPS separately. He wants it for their payments committee board pack. @nina-volkov — is that something we can pull together?
We can do it manually from the ledger. Give me a couple of days. What format does David need?
PDF if possible — he's presenting to the committee. Thank you Nina.
Quick heads-up — I'm doing some routine maintenance on the batch processing indexes this week. Shouldn't affect anything in production but if anyone notices slightly longer query times on the settlement reports, that's me. Will be wrapped up by Thursday.
Thanks for the heads-up Lena. Good to know ahead of the quarter-end batches.
Settlement summary report for Q1 is ready — BACS, FPS, and CHAPS broken down by month. Exported as PDF. @chris-dawson I'll send it over to you directly.
Perfect timing, thank you. I'll get this across to David today.
Just a flag for today — Pennington's end-of-day batch is going to be larger than usual. David's team is clearing some backlog ahead of quarter-end. Settlement window starts at the normal time but expect higher volume than a typical Monday.
Incident open — Pennington's daily settlement batch has not completed as expected. Scheduled completion was 15:00, it's now 15:30 and the batch is still processing. Investigating now. Marking as SEV2 given the SLA implications for a Tier 1 customer.
Just received a call from David. His team can see the settlement hasn't landed and they're asking what's happening. I've told him we're actively investigating and will have an update within 15 minutes. Tom — what can I tell him?
@lena-park can you look at the settlement engine logs? I'm seeing the batch job stuck but I can't tell yet whether it's a deadlock or just processing slowly under the higher volume.
On it. Pulling up the logs now.
Found it. There's lock contention on the batch_transactions table. The end-of-day aggregation query is competing with the ongoing transaction writes. The larger-than-usual batch volume today is making it significantly worse — the lock wait timeout is being exceeded and the query keeps retrying. That's what's causing the stall.
So it's a resource contention issue, not a failure. Do we have a fix path or do we need to consider halting writes?
I can optimise the aggregation query to run against the read replica instead of hitting the primary. That would eliminate the contention entirely. Give me 30 minutes to write, test, and validate it in staging.
Go. @ryan-kelly can you pull up monitoring for the settlement engine? I want visibility on whether the contention is worsening.
on it. batch job is currently 40% complete. at current processing rate it'll take another 2-3 hours to finish.
David is calling back in 15 minutes. I need a clear status I can give him. What's the message?
Tell him we've identified the root cause — database contention during the batch processing cycle. A fix is being deployed within the next hour. Settlement is expected to complete no later than 19:00. No data loss, no financial impact, just a processing delay. We'll provide a written incident summary to follow.
Clear — thank you. Calling him back now. Flagging ahead of time that David will want a formal written incident report. He mentioned SLA implications.
Updated David. He's not happy about the delay but he appreciates the transparency. He's agreed to wait for the 19:00 settlement window. He's asked for the formal incident report by end of week and said he'll want to discuss SLA implications once things have settled. Keeping him posted.
Fix is written and tested in staging. Deploying to production now.
Deployed. The aggregation query is now running against the read replica. Lock contention has cleared — the primary is no longer blocked. Batch processing has resumed at normal throughput.
confirmed — batch processing rate is back to normal. estimating full completion by 18:45.
Good work Lena. Keeping the incident at SEV2 until settlement fully completes and is reconciled.
I've sent David a brief update — fix deployed, batch processing normally, expected completion by 18:45. He acknowledged and said he'll have someone monitoring on their end.
Pennington daily settlement completed at 18:52. All amounts reconciled correctly — no missing or duplicate transactions. Closing the active incident. We'll run the post-mortem this week and I'll have a write-up ready by Thursday.
Just called David to confirm. Settlement has landed on their side — all amounts as expected. He's relieved but he was clear that a 4-hour delay is outside what Pennington expects from a Tier 1 provider. He wants the formal incident report and he wants to schedule a meeting to discuss our SLA. I've told him we'll have both.
Post-mortem for the March 23 settlement delay. Summary: database lock contention during end-of-day batch processing. Root cause — the end-of-day aggregation query was running against the primary DB and competing with live transaction writes. Higher-than-usual batch volume on the day accelerated the contention. Fix deployed same day. Three action items: (1) Query optimisation — deployed to production, complete. (2) Add batch processing monitoring alerts so we catch this earlier — @ryan-kelly can you handle that? (3) SLA review meeting with Pennington — @chris-dawson can you get that scheduled? David is expecting it.
alerting configured. we'll get paged if batch processing latency exceeds 30 minutes. tested and active.
Thanks Tom. I'll send David the incident report today and get the SLA review scheduled for next week.
One more thing to add to the post-mortem — I've also added an index on the batch_transactions table that should prevent lock escalation from occurring in the first place. Tested under load simulation and it's solid. The read replica change plus the new index means this shouldn't recur even at significantly higher volumes.
Nice. Thanks Lena — I'll add that to the post-mortem write-up.
Sent the formal incident report to David this afternoon. He's acknowledged receipt and will review with his team. He reiterated that he wants the SLA review to cover the broader service commitments — not just this incident — given Pennington's position as our largest customer. I've told him we'll have a date confirmed shortly.
Pennington's quarter-end BACS batch ran smoothly on the 30th. Completed well within the processing window. David's team confirmed all settlements received on their end. Good to have a clean run after last week.
March 31 quarter-end batch also completed without issue. All BACS settlements processed and confirmed. Volumes were as expected — nothing that stressed the engine. Lena's changes clearly helped.
Quick query from David's team — they'd like to adjust their standard batch submission deadline from 2pm to 2:30pm. Their ops team sometimes runs close on file preparation. It shouldn't affect our processing window. Any objections from the engineering side?
Fine from our side. The settlement engine has plenty of headroom now with Lena's changes. 2:30 works.
Great — I'll confirm with David.
Still need to get the SLA review into the diary. David's been tied up with internal audits this week — his team flagged that when I followed up. I'll get a date pinned down next week.
David's team has a question about the format of our daily settlement confirmation files. They want to start auto-ingesting them into their reconciliation system. @lena-park — is the file format documented anywhere?
Yes — it's standard ISO 20022 camt.054 format. I can put together a schema doc with field mappings if their technical team needs it.
That would be very helpful — thank you Lena. I'll pass it along to David's ops team.
Still working on getting the SLA review scheduled. David's team has been occupied with their internal audit cycle. I'll get that in the diary next week — just need to find a slot that works for his team.
March settlement reconciliation for Pennington is complete. All amounts match across BACS, FPS, and CHAPS. No discrepancies. Filed.
Thank you Nina. I'll include that in my monthly update to David.
Heads-up on Easter bank holidays — Good Friday (April 18) and Easter Monday (April 21). BACS won't process on those days. I've already communicated this to David's team and they're adjusting their submission schedule accordingly. No action needed from our side.
Caught up with David today — routine monthly call. Business as usual on their side, volumes are stable and he's happy with the processing reliability since March. He did ask about our API v2 timeline. I said I'd check with the product team. @alex-reed — quick one, David Hargreaves at Pennington is asking about the API v2 roadmap. Is there a timeline I can share?
SLA review is on my list for this week — I'll get that sorted. David has been patient but I want to get it done.
Early May bank holiday on Monday the 5th — same BACS adjustment as Easter. Pennington's team is already aware and has adjusted their batch schedule for that week.
April reconciliation for Pennington is done. Clean again — no issues across any payment rail.
Thanks Nina.
Starting to think about the Pennington annual review — it's due in June. The contract anniversary is technically March but the review has always been scheduled for June to give us a full year's data. I need to start pulling together the review pack over the next few weeks.
This is going to be a significant review — £680k contract, our largest customer. David will want to go through volumes, SLA performance, and roadmap in detail. @nina-volkov — can you start pulling together the financial summary for Pennington? Monthly volumes, transaction counts, and settlement performance across all rails.
On it. When do you need it by?
End of May ideally. The review will likely be mid-June.
I also need to get the SLA conversation done before the annual review. We had the settlement incident in March and I promised David an SLA review meeting at the time. That hasn't happened yet — I should get that sorted before we walk into the June review.
Had my monthly call with David today. He was fine across most topics but he specifically raised the SLA review — quoted back to me that we had committed to doing it in early April. I apologised and told him I would get it scheduled this month. He was polite about it but it was clear he was noting that we hadn't followed through.
I need to actually get this SLA review done. With the annual review coming up in June, we can't walk into that meeting without having addressed the March incident properly. Going to send David some date options today.
Happy to join the call if it would help — I can walk through the technical root cause and the changes we made. The settlement engine has been solid since March.
That would go a long way Tom, thank you. Let me get a date confirmed with David first.
I've sent David three options for the SLA review — week commencing May 19, May 26, or June 2. Hoping to get it done before the end of May so there's time to address anything before the annual review.
David has come back — he can do May 28th at 2pm. I'll send a formal calendar invite. @tom-brennan — can you keep that time free? I'd like you to present the technical post-mortem and walk through the monitoring improvements we've put in place.
May 28 at 2pm works for me. I'll prepare a clean summary of the root cause, fix, and the alerting we've added. Should be straightforward.
Morning. Just getting organised ahead of the Pennington SLA review on the 28th. @tom-brennan — can we find 30 minutes today or tomorrow to align on what we're presenting? I want to make sure we cover the root cause, the fix, and the monitoring improvements clearly. David will also want to discuss the broader SLA terms — the framework hasn't been reviewed since August and there are some areas I want to make sure we're aligned on before that conversation.