Weekly standup notes are up in the channel topic — sprint 47 kickoff today at 10am
Good morning team! Welcome to week 1. Let's use this channel for async updates, questions, and coordination. Please keep standup notes brief and use threads for detailed discussions. Looking forward to shipping some great stuff this quarter.
Sprint planning is at 10am UTC today. Folks, please have your capacity numbers ready. We're still juggling the Atlas timeline and some tech debt.
Good morning! Quick question — should I be looking at the settlement service refactor or is that blocked on something? I'm ready to jump in.
hey emma, settlement service is good to go. i've got some notes in the wiki — let me know if you hit any snags
Kubernetes migration plan is ready to go. Waiting on finance to approve the cloud budget so we can actually provision the infrastructure. Should be ~120k for the first year.
Started on the webhook signature validation PR — going to need some review on the crypto logic when it's ready
Happy to review that — webhook security is critical. Just ping me once you push the draft.
Just wrapped the investigation on the Onfido rate limiting issue. The problem was we were batching requests too aggressively. I'll have a fix in the backlog by EOD.
Quick check-in: can we get API v2 into staging by end of week? The product team wants to start integration testing.
api v2 is close but not quite ready for staging. we need at least 2 more days of testing on the payment flow. posted a PR yesterday if you want to look
Morning folks. Just checking in on the Atlas team status for this sprint.
Atlas is blocked on data pipeline work from Lena's team. Folks, we need to get this unblocked ASAP or we're losing the whole sprint.
We're on it. The settlement data schema needs to be updated but we're waiting for product to confirm the new field requirements. I pinged Alex again.
Got the ping! Sending you the spec now. Sorry for the delay.
Question on the settlement service — where should validation happen, in the handler or in the service layer? Saw a few different patterns in the codebase.
good catch. we've been inconsistent. for new code, do it in the service layer so handlers stay thin. i can share an example if you want
Data pipeline metrics updated. Churn on the transaction reconciliation job is down to 0.3%. Ready to scale the batch size.
Good news. What's the expected throughput bump?
Roughly 18-20% increase in daily reconciliation volume. Should help us avoid backlog during peak hours.
Code review queue is getting long again. If anyone has bandwidth, we've got 7 open PRs waiting.
Fixed the circuit breaker timeout issue in the payment gateway connector. Tests are all green.
Quick reminder: on-call rotation is rolling to a new group starting tomorrow. Please check the schedule and make sure your laptop is charged.
deploy to production is scheduled for 2pm UTC. API v2 beta endpoints only. Folks, if you're on call, please keep an eye on the dashboards.
Webhook delivery latency is still creeping up on weekends. I think we need to look at queue depth again.
Can you check if this is load-shedding kicking in? I can pull the metrics from the last 3 weekends.
Yeah, that's it — the workers are hitting the max queue depth threshold around 3pm UTC. We need more workers or better batching.
deploy went smooth. no errors on the beta endpoints. thanks to everyone who helped test
Reviewing the API v2 docs. Good work, Tom. One question about the rate limiting behavior on the new endpoints.
thanks sarah. rate limiting is the same as v1 for now — 100 req/min per client. we can tune it later if needed
All-hands on Friday at 11am UTC. Priya's got some updates on the roadmap and budget. Please try to attend synchronously.
Morning team. Do we have any known issues with the Broadgate onboarding flow right now?
not that i've seen. health checks are green. what's the issue?
Customer reported slow responses this morning. Might just be transient, but flagging it.
Frontend dashboard updates are live on staging. Team, can we get a final review before going to prod tomorrow?
I'll check the logs. Probably just a spike in traffic.
Quick update: finished the settlement validation work and posted a PR. Would appreciate a second set of eyes before merge.
Just saw your PR Emma. Looks solid. Minor comment on the error handling but overall good work. Will approve after that tweak.
Contract testing framework is up and running. Initial test suite has 240 tests covering the core API endpoints.
Engineering all-hands is tomorrow at 11am UTC. Looking forward to sharing some exciting updates on the product direction and addressing some of the infrastructure challenges we've had. See you there.
Alright, I'm starting the Kubernetes migration work today. Will need access to the staging environment and some time with Marcus on the network architecture.
I can block time this week. Let's align on the network topology and security requirements.
Can someone explain the new logging strategy we're moving to? Saw the RFC but want to make sure I understand the cost implications.
Question: are we still planning to deprecate the old payment endpoint by Q3?
Morning. Can someone review the payment reconciliation service? I think it's ready for staging.
i'll take a look. posting feedback in the PR
we're moving to structured logging with sampling at debug level. costs should be 30-40% lower than current setup. lena has the breakdown
that's the plan. we're tracking migrations on the dashboard. still have a few stragglers but most customers are on v2 now
Load test for the new payment processor integration is scheduled for tomorrow at 2pm UTC. If you want to watch the dashboard, link is in the pinned messages.
Yeah, the RFC has the full analysis. Main win is we sample debug logs and only keep full logs for warnings/errors. Payoff should be immediate.
This is super helpful context. We should probably document this in the onboarding guide so new engineers understand the logging patterns.
I'd like to take the logging migration work. Who should I pair with to get up to speed?
Happy to pair. Let's sync up Monday morning and I'll walk you through the migration plan.
Morning team. Just deployed the KYC improvements to staging. Would be good to have someone from the front-end team test the UX flow.
Has anyone had issues with the settlement service timeout recently? I've been seeing occasional 504s in staging.
Webhook signature validation PR is ready for review. Thanks tom-brennan for the early feedback.
Same. I think there's a database query that's slow under load. Lena mentioned it in standup but I can't find the ticket.
Yeah, it's TICKET-2847. We're adding an index. Should be deployed by Monday.
Quick update: we're adding more capacity to the on-call rotation to reduce burnout. New rotation starts next sprint. Thanks for your patience while we've had stretched teams.
All-hands went well. Thanks everyone for the engagement. A few follow-ups: (1) Cloud migration is now officially approved — waiting on finance to move the budget. (2) Atlas timeline is still tight but doable. (3) Incident response process updates coming next week.
budget approved? finally. i'll get started on the terraform configs and infrastructure planning. we should be able to start provisioning in 2 weeks.
quick question on the incident response process — will there be more documentation or training?
Yes, we'll have training next Thursday. It's mostly clarifications on escalation paths and communication. More details coming Monday.
Kubernetes upgrade testing is going well. No blockers so far, but I want to do one more full chaos test before we push to prod.
Just merged the database migration for the new customer fields. No downtime deploy, should be smooth.
heads up — provisioning sandbox access for a new EU integration this week. if you see unfamiliar traffic patterns in the staging environment that's us. should be clean but flagging just in case.
Posted a question in #data-engineering about the ETL pipeline. Not sure if that's the right place but I'm trying to understand how the data flows for the new Atlas features.
Saw your question. I'll answer it there. Good initiative diving into the data side!
How are we doing on the search optimization work? Last I heard it was blocked.
Still blocked on the index redesign from product. They're reviewing the trade-offs. Folks, this is why we need clearer requirements upfront.
Starting work on the historical data migration tooling. This is going to be a bigger lift than initially scoped.
What's the new estimate looking like?
Probably 3-4 weeks instead of 2. The data inconsistencies in the legacy system are worse than we thought.
We should have feedback by Monday. Sorry for the delay on this one. The search team was pulled onto a customer issue.
no worries. we'll pick it up next sprint
Payment gateway timeout handling is fixed. Also added better logging so we can debug these issues faster in the future.
Anyone have bandwidth to pair on the KYC frontend? The backend is ready but the UX flow needs some love.
I can help with that! I've been working on validation logic so I'm familiar with the flow.
Great to see people jumping on pairing work. This is how we move fast as a team.
Kubernetes planning session with Marcus went well. We're looking at a phased migration: 1) dev environment, 2) staging, 3) production. Timeline is 6 weeks if we stay focused.
Excellent. That timeline works well with our Q2 planning. Keep the team posted on blockers.
Security patch for the HTTP client library — everyone needs to bump the dependency version by end of day.
Just noticed some inconsistency in the error responses between endpoints. Should we standardize?
good catch. yeah we should. can you file a ticket and we'll add it to the next sprint?
I've wrapped up the database index work for the settlement service. Merged to main. Should see performance improvements in production by end of day.
Weekly standup notes are up in the wiki. Please review and flag anything missing.
Pair programming session with marcus on the idempotency work — good progress on the event sourcing pattern.
Sprint 2 planning is Monday at 9am. Folks, we're going to have some hard conversations about priorities. Cloud migration, Atlas, and incident response training all competing for cycles. Let's be realistic about what we can do.
Quick question: should I start on the new API endpoint or focus on tech debt first? Just want to make sure I'm working on what the team needs most.
good question. let's talk about it in sprint planning on Monday. we'll have more clarity on priorities then
Payment processor integration load test results: passed with flying colors. Peak throughput hit 15k TPS with 99.2% success rate.
Weekend reading: the Kubernetes migration spec is finalized. Anyone interested in reviewing should check the wiki before Monday.
Team update: Pennington settlement delay issue from last week was a cascading failure in the notification service. Root cause was a timeout that we've now fixed. Good catch by Lena on the investigation.
Wait, did that affect other customers or just Pennington?
Just Pennington. They have a higher volume of settlements so it exposed the timeout issue. Other customers hit it too but less frequently.
This is exactly why we need better alerting. Folks, this is a perfect example of what we're trying to improve with the incident response process. We need to catch these things faster.
Kubernetes upgrade is live in staging. Running final sanity checks. Plan is to go to prod early next week.
Historical data migration is 40% done. Found some edge cases with time zone handling that needed fixing.
Good catch. Let's make sure those edge cases are covered in the final test suite.
sprint 2 planning went well. we're focusing on: 1) kubernetes migration phase 1, 2) atlas team support, 3) incident response training. api v2 deprecation gets bumped to sprint 3.
good. i can start the k8s setup this week. need about 10 hours from marcus on network planning.
I've got time blocked. Let's schedule for tomorrow morning.
I'm jumping on the Atlas support work. Tom, are there any gotchas I should know about before I dive in?
yeah, the data pipeline is still a bit chaotic. talk to lena first — she knows all the edge cases. after that you'll be golden
Incident response training is scheduled for Thursday at 2pm UTC. All engineers should attend. It's an hour.
Webhook queue depth is back down after the scaling work. Weekend latency is now stable at <150ms p99.
Quick question about the KYC flow — when should the UX validation happen vs. the backend validation?
Emma and I discussed this while pairing. We're doing client-side validation for UX feedback, server-side for security. It's in the PR if you want to check.
Perfect, that makes sense. I'll review the PR today.
Payment gateway resilience testing complete. Added 3 new failure scenarios we weren't handling before.
Code review process is being optimized. New SLA: all reviews within 24 hours. Let's see if we can maintain that.