Community
In March, the Treasury Committee delivered worrying news. Since January 2023, the top nine banks faced 33 days of unplanned outages. This excludes the payday system crash in February, which makes matters worse. These IT failures dominated headlines just as the Digital Operational Resilience Act (DORA) came into force earlier this year.
What’s causing this fragility?
Banks blamed hardware changes, software bugs, human errors, DDoS attacks, and third-party failures. Digging deeper reveals some interesting root causes: expired certificates, outdated documentation, software incompatibilities, and missed maintenance. UK banks have skilled tech teams and invest heavily in their systems. Yet outages continue, pointing to deeper, systemic issues. What’s really behind the cracks? Let’s delve into this.
As customers shift to digital interactions, banking operations have become far more complex and reliant on technology. Customers demand instant access to banking services—no more waiting for branches to open. They use many types of digital channels: mobile apps, call centres, aggregation platforms, websites, open banking, mobile wallets, and third-party apps to access banking services. Fraud has grown more advanced, forcing banks to step up efforts to protect customers. Legal obligations add to the challenge, with banks monitoring for multitude of illegal transactions such as money laundering, drug financing, transactions with sanctioned entities etc. Keeping systems secure, seamless, and always-on in this new digital era has become massively challenging.
To meet these demands, banks have invested in new technologies, architectures, software and hardware products. Monolithic applications are being replaced by microservices. Overnight batch jobs are giving way to near-real-time event-driven processing. Security controls have shifted to sophisticated identity and access management services, with advanced features like multi-factor authentication and biometrics. API usage has gone up - linking internal systems and connecting banks to the wider world. Hosting has become more diverse, with public clouds, private clouds, containers, and specialised hardware joining the mix. Yet, older technologies still linger, making the IT landscape heterogenous and complex.
The result?
What was once a relatively simple IT setup has morphed into a tangled web of connections, dependencies, and varied technologies. It behaves like a complex system, vulnerable to the "butterfly effect," where small issues can spiral into major disruptions. Addressing this systemic challenge requires a multi-pronged approach. Architecture, application design, delivery methods, IT operations, funding models, and risk management all needs a re-think and new approach. While many of these areas are evolving to tackle resilience issues, IT operations remain stuck in outdated practices. Most approaches are rooted in traditional methods, no longer suited to the demands of always-on systems. In the remaining sections, we will highlight some of the key areas that needs to change for banks to be become more resilient.
Banks structure their IT operations in rigid layers—L1, L2, monitoring, infrastructure, applications etc. The layers are further divided by departments specialising in specific lines of businesses such as mortgages, credit cards, or IT resources such as servers, storage, networks. This worked when the landscape was simpler. But today that is not the case. Even a simple request like checking account balance, triggers a web of software and hardware components: front-end mobile app, channel microservices, authentication services, in-app biometric verifications, API calls, queries to cache, messages moving back and forth between integration layers etc.
So, when issues arise, many staff owning each of these “components” and “components within components” need to come together. In the current organisation model, each team is however incentivised to prove that the fault lies elsewhere which slows resolution. It’s like trying to steer a ship while every crew member argues about who’s to blame for the leak.
Second, this structure works against the stated intent of "resilience by design." Teams focus on their own KPIs—staying within budgets or stabilising specific systems. These local priorities rarely align. When stitched together, they fail to deliver the seamless resilience customers expect or regulators demand.
Changing these organisational structure is not a quick fix; it takes time and must be approached with care. Any new model should preserve specialist expertise that current structure nurtures while creating alignment across people, processes, KPIs, and governance around end-to-end services that truly matter.
At the start of this blog, we noted that many outages stem from issues like expired certificates, incompatible versions, and outdated security components. A quick search shows this is not unique to UK banks. Similar avoidable issues occur worldwide. Why do these seemingly simple tasks, such as updating certificates, often go overlooked? It reflects a deeper issue: where automation lags, fragility follows.
Let me explain. Banks now deploy a vast number of software components across on-premise and cloud platforms. This surge has dramatically increased the number of digital certificates. Thousands require constant tracking and frequent updates. Stricter security standards shorten certificate lifespans, adding pressure. Yet many banks still manage certificates manually, often using spreadsheets. Central oversight is lacking, making lapses inevitable. Certificates expire “unexpectedly,” causing disruptions. Centralised and automated management of certificates is no longer optional. It is essential as the volume continues to grow.
But meaningful automation is easier said than done. It faces hurdles like integrating legacy systems, meeting strict compliance requirements, and resolving the complex engineering challenges of banking IT. To succeed, banks need a clear automation strategy. They must start small, scale thoughtfully, and adopt the right tools and expertise to streamline the process.
Over a fifth of the outages noted by the Treasury Committee came from third-party service failures. The UK is not alone. American banks have faced similar issues. Earlier this year, a major US bank suffered a multi-day outage caused by a vendor failure. Thousands of customers were unable to access accounts or receive payments.
Banks now depend more heavily on third-party products embedded in critical processes. But their resilience strategies haven’t kept up. A decade ago, banks installed third-party software on-site, tested it rigorously, and only then went live. Updates followed the same careful checks. Today, third-party tools are woven into operations—APIs, SaaS platforms, cloud infrastructure, and open-source components underpin modern banking systems. This shift brings risks. Problems at a provider ripple through banks, disrupting services as we saw in the CrowdStrike event. Denial-of-service attacks, compatibility issues from updates, or vulnerabilities in open-source elements can all cause chaos. Few banks are equipped to contain these risks or stop them spreading.
Fixing this requires a broad rethink. 3rd party contracts must enforce stricter standards. Engineering teams need to focus on managing third-party risks, such as isolating faulty services and switching to backups. Procedures for handling changes to third-party products must cover various scenarios. Banks should also upgrade their staging and pre-production environments where 3rd party products are tested. Above all, banks must recognise that they bear ultimate accountability for delivering services, even when relying on third-party products. Their IT operations methods, tools, and approaches must be designed to ensure they can fulfil this responsibility effectively.
Observability—the ability to monitor IT systems in real time—is essential for resilience. But the growing complexity of modern banking systems makes this a daunting task. Monitoring thousands of applications across varied technologies, multiple cloud platforms, APIs, microservices, databases, and networks is like trying to map every ripple in a stormy sea. It’s overwhelming and error prone.
Tooling often adds to the confusion rather than resolving it. Most tools are designed for specific tasks, such as application performance monitoring or security anomaly detection. Few banks connect these tools to create a full picture of operations and end-to-end view of critical services like payments, cash withdrawals. IT operations teams are left to interpret the flood of alerts and reports and make sense of their implications. This leads to cognitive overload and limits action. Fragmented organisational structures make it worse.
Take one recent incident. A client’s API gateway ran out of disk space, triggering a P1 outage. The signs were obvious—disk space had been shrinking for months as transaction volumes grew. The appliance support team made repeated efforts to create extra space but couldn’t prevent the outage. The root cause? Poor collaboration. The appliance team, API owners, gateway management, and custodians of IBS services did not come together to assess the impact of shrinking diskspace and plan preventive measures. The result? A foreseeable problem left unresolved.
To match the complexity of IT systems, observation and monitoring tools must evolve. They should integrate inputs from diverse systems to provide a complete view. AI capabilities can help—spotting risks early, reducing false alarms, and guiding support teams during critical moments. Tools must move beyond basic log analysis, connecting data across sources and timeframes to recommend actions. Achieving this demands investment—not just in tools but also in operational procedures. Without such changes, resilience challenges will only worsen.
Today's IT operations follow a familiar playbook. Change and Run teams work separately. Application and infrastructure boundaries are rigid. Features for specific business needs often take priority over resilience or performance. Incident resolution is favoured over prevention, with supplier contracts designed to reward speed of resolution and not incident reduction or automation.
These practices hinder resilience. As systems grow complex and transaction volumes rise, this model struggles to adapt. Site Reliability Engineering (SRE) offers a solution. It unites support and development teams under a shared backlog. It balances speed with stability. Operations become a software problem, with resilience responsibilities are owned by everyone. SRE reduces toil -automates manual task and aims to prevent incidents instead of reacting to them.
But adopting SRE in banks is not easy. Legacy systems and scattered tools pose obstacles. Cultural resistance adds friction, with teams reluctant to move beyond silos. Defining and measuring reliability in interdependent systems requires serious effort and often require few iterations.
Still, banks must rethink IT operations. Complexity demands change. SRE is challenging to implement, but it is vital for achieving stable, resilient systems. The road may be tough, but the rewards—a modern, efficient IT environment—are worth the effort.
This content is provided by an external author without editing by Finextra. It expresses the views and opinions of the author.
Naina Rajgopalan Content Head at Freo
29 May
Igor Kostyuchenok SVP of Engineering at Mbanq
28 May
Carlo R.W. De Meijer Owner and Economist at MIFSA
Kunal Jhunjhunwala Founder at airpay payment services
27 May
Welcome to Finextra. We use cookies to help us to deliver our services. You may change your preferences at our Cookie Centre.
Please read our Privacy Policy.