An Asian publishing company embarked on a major transformation program to deploy AI-based analytics and business intelligence solutions across all their strategic business units. The program witnessed initial success during the proof-of-concept and early validation stages. This encouraged the company to initiate a phase-wise enterprise roll-out. However, the roll-out turned out to be a major failure, primarily because the working prototypes and alpha releases could not be scaled up to the level of enterprise solutions or operationalized in production environments. Remediation efforts could not get the program back on track. The company eventually shelved this expensive program, and recognized the investment as loss in their books.
This is not an isolated example. As more and more companies begin to invest in AI-driven digital transformation, similar stories of great starts but eventual setbacks (or even failures) are being talked about in the AI industry circles. While the reasons differ from company to company and from program to program, a major factor that often emerges from the root-cause analyses of such derailments is a phenomenon known as Technical Debt.
[siteorigin_widget class=”thinkup_builder_divider”][/siteorigin_widget]
What is Technical Debt?
Technical Debt (TD) can be described as the hidden cost that organizations accrue due to sub-optimal software design and/or development that, while addressing short-term goals in an expedient manner, necessitate significant future work. It is the price that companies pay due to (i) prioritizing the ‘cost or speed of execution’ over the ‘quality of the solution’, or (ii) due to poor technical decision-making. The negative effects of TD include low developer productivity, complex code that make future development difficult, frequent application outages, increase in defects, performance & scalability problems, and others.
The concept of TD has been applied to the field of software engineering since the 1990s. However, there are multiple viewpoints on its definition and scope. One of the most widely accepted explanations is Martin Fowler’s four types of technical debt (see below.)
Is Technical Debt always bad? Not necessarily.
In certain scenarios, it may be an optimal strategy to incur TD first, and address it later (if the need arises, and to the extent needed.) This is often true of limited/low-end applications such as prototypes, proofs-of-concept, and early-stage releases. However, if the goal is to build sustainable solutions with long-term benefits (such as those in digital transformation programs), then the prevention, identification and repayment of TD in a structured manner becomes extremely important.
What Drives Technical Debt in AI-driven Transformation?
Technical Debt (TD) is a regular phenomenon in all software-based programs. Additionally, the accumulation and impact of TD increase significantly when transformation programs are driven by exponential and evolving technologies like Artificial Intelligence. Several factors serve as major TD drivers. More importantly, the effects of these drivers increase manifold with the addition of new features, advanced functionalities and complex integrations.
The first major driver is the practice of treating AI-based transformations in the same way as other digital or software-based transformations. This, in turn, leads to the practice of developing and releasing AI applications within short cycle times (rather than focusing on building AI applications that are inherently resilient.) AI scientists and software developers are often aware of the deficiencies of their solutions but still release them due to business pressures, tight schedules, cost control, and other factors.
Many companies focus too much on ‘visible’ transformation that they can quickly or easily showcase. This strategy, when deployed in AI programs, may enable small gains but often end up preventing the bigger ones.
Another major driver is the widely prevalent practice of leveraging open-source artifacts (such as libraries, frameworks and pre-built models) in developing AI and software solutions. These artifacts generally suffer from two critical limitations. Firstly, they contain multiple technical dependencies, some of which may be unnecessary or obsolete for the solutions being developed. The injection of these non-relevant codebases in the repositories is a major source of TD. Secondly, not every AI developer, data scientist or software engineer in the open-source community follows best-in-class practices and produces high-quality code. As a result, the usage of these artifacts sometimes injects bad code into the overall repositories, and leads to TD accumulation.
The huge global supply-demand gap in AI skills is the third important driver. As a result of the limited talent pool, companies often end up using amateur AI developers, low-skilled data scientists and junior engineers to deliver their transformation programs. Consequently, technically low-grade and inefficient solutions may be designed and developed that keep accumulating significant TD as time progresses.
Finally, an important driver of TD is the fact that large-scale AI transformation programs generally involve multiple technical teams, often from different vendor companies. It is often observed that, under the hood, these teams follow their own technical standards and practices. Consequently, programming inconsistencies and integration failure points may get injected into the repositories, thereby leading to significant TD accumulation.
The Accrual of Technical Debt in AI Projects
AI projects have certain specific TD challenges vis-à-vis other software projects.
- AI systems are inherently complex, particularly because of the interaction of the AI design/code with the rest of the software design/code. The higher the complexity of a system, the greater is the risk (and impact) of TD accumulation.
- The front-end and back-end applications that deploy/integrate/execute the AI components always contribute to TD accrual. The prioritization and repayment of such debt often need unique strategies that vary greatly from project to project.
- The AI models may become obsolete or inefficient as time progresses, and may need to be periodically modified, upgraded, re-trained or re-optimized. Moreover, the use of large amounts of data in AI projects adds a whole new dimension of TD.
- Finally, AI technologies are at the early stages of their respective maturity cycles. The use of evolving technologies in complex projects always leads to TD accrual.
As a result of the above challenges, the accumulation of TD in AI projects takes place in multiple ways, and in varying degrees.
Examples of TD accrual in AI projects
- Bundled functions, correlated features, tight class & method dependencies, and legacy models generally lead to significant TD accrual as time progresses.
- Data inconsistencies (often the result of poor data collection or inadequate pre-processing) may cause unstable input signals, and lead to TD accumulation.
- Deterministic (or fixed) thresholds for prediction or classification may become invalid over time, and may lead to TD accumulation.
- Extreme forms of Agile development (such as YAGNI), when applied to AI projects, almost always end up adding TD.
- Pipeline jungles, often the result of glue codes or when AI applications are developed by multiple teams, may lead to serious TD accrual.
- Spaghetti code, especially those resulting from indiscriminately combining multiple open-source codebases, always end up adding significant TD.
- Unknown or hidden feedback loops in AI systems may create situations where inference exercises adversely impact model training, thus creating new TD.
In addition, there are certain AI-linked phenomena that may not directly create TD, but significantly increase the impact of TD by combining with other factors.
A good example is Model Decay. This is a phenomenon that occurs when the performance of AI models deteriorates over time, and the models may eventually become unsuitable for operationalization. This phenomenon, when combined with other TD factors, has the potential to seriously derail entire AI programs.
Another example is Concept Drift. This is a phenomenon that occurs when the properties of the target variables change unpredictably, thereby changing the interpretation of the data – even though there may be no change in the data distribution itself. This often leads to unintentional shifts in decision boundaries, thus augmenting the impact of TD.
[siteorigin_widget class=”thinkup_builder_divider”][/siteorigin_widget]
How Should Technical Debt be Managed?
First and foremost, teams involved in AI-driven transformation must understand three key characteristics of Technical Debt (TD).
- Software-based transformations will never be free from TD. As such, TD cannot be fully prevented but only managed.
- TD is a continuous phenomenon, and gets accumulated at different stages of the transformation life cycle, in different ways, and with varying impact.
- The identification, impact measurement and repayment of TD are complex tasks. The level of complexity increases further when AI technologies come into play.
As companies transition from ‘Experimenting with AI’ to ‘Operationalizing AI’, business leaders, AI developers & data scientists need a deeper understanding of the causes, impact and management of TD.
Next, transformation teams should implement a two-fold Technical Debt Management (TDM) strategy.
- Abstracting and decoupling various business and data systems, as much as possible, while designing, architecting and developing the transformation solutions.
- Frequently validating the form and structure of the technical artifacts that are developed and deployed during the course of the transformation exercise.
The first strategy involves measures such as the adoption of standard design patterns and modern architectures (particularly Cloud-native and Microservices), efficient API life-cycle management, de-coupling different software components, de-coupling legacy applications from other applications, de-coupling data pipelines, and other such steps.
The second strategy involves multiple measures such as Agile development, Refactoring, DevOps practices, Test-Driven Development (TDD), maintaining high-quality data pipelines, improving documentation, building mechanisms for detecting model decays and concept drifts, continuous model training, testing and optimization, and others.
It is important to include TD checks and repayment tasks in the iteration backlogs or product backlogs, or even track them as full-fledged TD backlogs. Repayment goals (such as Refactoring Targets) can be set up as percentages of the overall development efforts. TD prioritization is equally important to ensure that critical and high-impact debt are addressed before the others. Certain preventive measures (such as efficient AI/software code reviews and high-quality data pipelines) must be made mandatory. These measures ensure that less TD get injected into the system, and also slow down the pace of debt accumulation. Tools like SonarQube can be leveraged. Furthermore, anomalies and sudden changes in the predictive behavior of AI systems need to be analyzed closely as they may provide early warning signals. Data dependencies, particularly under-utilized dependencies, should be carefully assessed.
Closing Comments
Managing Technical Debt is a complex exercise, and needs significant technical and domain expertise. It is alarming to see that many companies investing in expensive AI and digital transformation programs are often unaware of the concept of Technical Debt, while some understand it only at a superficial level. This is especially true of companies where the AI/Data Science or digital transformation teams are not supported by strong software engineering talent. Moreover, a key challenge for practitioners is to determine TD-related metrics for use in AI programs. While software TD metrics have evolved over the years, AI & Machine Learning TD metrics are still in their formative stages.
Organizations must understand that the impact of TD is not just limited to derailing and disrupting transformation programs, but often extends to creating long-term barriers to innovation. This, in turn, may create significant risks to sustainable competitive advantage. Hence, the strategic importance of efficient TD management in high-end transformation programs is immense, and this will only increase as the stakes go higher.
Winning the AI revolution is not just about understanding ‘what drives (or enables) AI, and how’. It is also about understanding ‘what inhibits (or limits) AI, and how’.