It is common knowledge that efficient architecture design is a key aspect of building any product/solution, including AI applications. However, in reality, it is often observed that companies pay limited attention to developing robust end-to-end architectures before initiating AI application development. This leads to several problems like schedule/cost overruns, automation job failures, interoperability & scalability issues, low predictive power of AI systems, poor application performance, unstable production deployments, etc.
Background
This paper stems from a discussion that I had with the technical team of a stealth-mode AI startup that I mentor. Here is a snippet of that discussion.
Founder/Tech Lead – Don’t you think that we should just focus on the functional requirements till we get the alpha out? The non-functional elements can be addressed later.
Tamal – Not really. While the functional features will obviously take precedence, you cannot completely ignore non-functional attributes like security, performance & interoperability. Firstly, once you release the alpha, you’ll have very little time due to pressure from your anchor investors and early (or potential) clients to release the beta. Secondly, if you are not provisioning for key non-functional requirements at this stage, it will not be easy to incorporate them later on. There will be a lot of technical challenges to deal with, and those can be overwhelming, especially for a small team like yours.
Lead Engineer – But that will slow us down, at least by 3 weeks.
Tamal – Yes, but this 3 weeks of delay at a time of less pressure and minimal existing development is better than the pain you may face once you are deep into the development game and expectations increase. Also, this 3 weeks of work may easily scale up to 4 to 6 weeks (or even more) by the time you move close to the beta.
Data Scientist – But this is an AI application, and our emphasis is on NLP and Machine Learning. Should we still focus too much on security, interoperability, etc.?
Tamal – Even more so. Remember that while there is a lot of hype around AI today, most customers are still unsure about how to evaluate and implement new AI/ML applications. Any suspicion in their minds regarding your ability to provide them production-level performance (e.g. scalability, high availability, negligible downtime, etc.) can be disastrous. The first question that your customers will ask is if this solution can be integrated with their current technology landscape. Note that your solution is only going to solve a small part of their requirements. Hence, they may not even exhibit interest in your demo if they feel that this integration is going to be difficult. Their next questions will center around how you will address things like data privacy and security, especially if you expect them to provide ML training data to you at some stage.
… The discussion progressed for an hour, and the team decided to focus more on improving the technical design and architecture before diving deep into the alpha development.
Conversations like the above are increasingly becoming common in the AI industry. While everyone understands the need to develop robust architectures, well-designed end-to-end architectures that enhance the value of AI applications are more of an exception than the rule. This is primarily on account of three factors:
- Most companies have separate Machine Learning and Software Engineering units that operate independently and/or without shared ownership (at least in terms of planning and execution, even if they are under the same management.) These teams generally pursue different architectural frameworks, processes and standards based on their respective workloads. Rarely do they emphasize on establishing holistic, end-to-end architectural designs that combine both their worlds.
- The intangible value of returns from investments in architecture design is a concept that is not well understood by many companies. It is not uncommon to see architecture designs being directly applied from past projects, with minimal or no change. There is little emphasis on developing greenfield architectures or optimizing the brownfield ones to account for the specificity of the new use cases. This is often true of service organizations where the major objective function, more often than not, ends up as cost/cycle time minimization, thus leading teams to cut corners. And one of the key areas that often get impacted is architecture design.
- Architecture design, particularly for AI applications, is a complex exercise. It needs depth and breadth of understanding of advanced concepts, processes, frameworks and technologies at multiple levels – infrastructure, data, middleware, application, AI, user experience, performance engineering, etc. There is a general lack of good architects that possess these skills. Moreover, the evolving nature of AI (and allied technologies) further add to the constraints in designing optimal architectures.
What does it mean to design efficient end-to-end architectures for AI applications?
While an in-depth discussion on AI solution/application architecture is beyond the intended scope of this article, below are the key architectural points that need to be well understood and optimally addressed before developing AI applications.
- Reduce the AI system’s complexity, both at structural and behavioral levels. This is achieved by designing for greater cohesion within the elements of each module, loose coupling among different modules, reducing dependencies (both within and among modules), increasing controllability of components, decomposing the business problem into smaller units and optimally partitioning them (e.g. separating AI logic and engineering workflows), implementing the ‘principle of information hiding’, etc.
- Implement modern architectural patterns such as Blackboard, Broker, Cloud-Native, Event-Driven, Layered, Microkernel, Microservices, Model-View-Controller (MVC), Peer-to-Peer, Serverless, Service Oriented Architecture (SOA), etc. Particular emphasis must be placed on Evolutionary Architectures that support guided incremental changes through the optimization of Fitness Functions. The architectural pattern determines how the application will eventually shape up, and steps must be taken to ensure that the pattern selection process is in accordance with the needs of the AI solution.
- Determine the Machine Learning architecture and related factors. These include determining the search process for neural networks, finalizing the structure and type of networks, hyper-parameter selection/optimization, transfer learning options, integration process of different agents, determining the type of computational graph, etc. Additional factors may need to be assessed (depending on the complexity of the Machine Learning problem), such as adversarial architectures, attention-focusing mechanisms, second order optimization, quantization with smoothing, neuro-evolution techniques, etc.
- Address the specific needs of real-time AI applications. Such applications (particularly those related to customer service, defence, finance, fraud prevention, manufacturing, transport, IoT, etc.) often need specific features, such as tight integration of software and hardware architectures; linkages among multiple model inferences from multiple channels; dynamic identification and notification of outlier events; inference or analysis at the edge; high-end streaming architectures; etc.
- Address the high compute and high memory needs of Deep Learning applications. Due to the compute-intensive and memory-intensive nature of such applications, there is often a need for additional high-end software and hardware components. Examples include advanced hardware accelerators, hyper-performance engineering systems, memory-augmentation boosters, ‘out-of-the-box’ Big Data architectures, etc.
- Design multi-task learning agents (since many use cases today require concurrent optimization of multiple loss functions) through approaches like block-sparse regularization, parameter sharing (hard or soft), fully-adaptive feature sharing, cross-stitch units, etc.
- Design for high extensibility, interoperability, scalability, usability and portability. Note that a common need today is to ensure that AI applications run on multi-Cloud environments, and also get seamlessly integrated with other applications that run on multi-Cloud environments. Additionally, factors like automated data pipelines, high concurrency, multi-tenancy, runtime optimization, optimal function invocation, and exception handling mechanisms need to be appropriately addressed.
- Design for high availability, fault-tolerance and security, such as 99.999% system uptime, zero-downtime upgrades, retry strategies for transient faults, failover-failback measures, graceful degradation strategies, backup-recovery mechanisms, continuous delivery and deployment, tight access control and authentication, multi-level security provisioning, OWASP tests, compliance measures, etc.
- Provision for Technical Debt management. Most AI applications are built on top of open-source or 3rd party frameworks, libraries and datasets. While this helps to create reference structures (that AI teams can follow) and reduce the development-deployment-maintenance cycle, they also introduce a lot of dependencies and unnecessary code or functions into the AI system. These often end up as technical debt that get compounded over time, and pose structural risks to the long-term stability of the applications. Hence, AI teams need to ensure that there is a mechanism to manage this technical debt.
- Optimize trade-off among conflicting attributes. Architectures need to strike a balance among multiple conflicting factors. For example, high fault tolerance can be achieved through triple-mode redundancy but that increases computing time and costs. Similarly, deep logging levels ease up triaging and defect resolution efforts, but they are achieved only with higher storage and performance costs. Other considerations include Technology A versus B, Cloud X versus Y, latency versus throughput, security versus customer experience, etc. These trade-offs need to be rigorously planned, executed and managed in accordance with the core value proposition of the AI applications.
- Adhere to the SOLID Design Principles – Single Responsibility Principle (SRP), Open/Closed Principle (OCP), Liskov Substitution Principle (LSP), Interface Segregation Principle (ISP), and Dependency Inversion Principle (DIP). While these principles originate in object-oriented programming (OOP), their core philosophies can (and should) be applied to every AI application.
- API Lifecycle Design must ensure efficient abstraction, autonomy, discoverability, management and reusability. Additionally, it should support seamless evolution of APIs so that changes to current APIs do not impact existing applications that consume them.
- Other factors such as baselining approaches, complex event processing goals, code maintainability, optimization of production environment, prioritization of requirements, rationalization of system and application overheads, and selection of metrics/KPIs (e.g., cyclomatic complexity, confusion matrix, depth of inheritance tree, divergence measures, F1 score, etc.) are also important.
Why should AI teams pay extra attention to architecture design?
A strong focus on architecture enables key stakeholders to comprehensively understand the future state of the AI application. It also helps in assessing risks, constraints, dependencies and other critical factors before the start of the execution cycle, thereby ensuring a high level of preparedness. AI development is expensive and time-consuming, and unforeseen shocks and high unpredictability can be extremely tumultuous. Well designed architectures also ensure that teams never lose sight of key functional and qualitative attributes of applications at any stage. This is specifically relevant because AI projects face multiple challenges at different stages of their life cycle, and robust architectural plans help to keep the focus intact to deliver the core value proposition of the AI applications.
Developing AI applications is more challenging, arduous and complex than regular software application development.
- Firstly, there may be a need for state-of-the-art computing platforms for the AI capabilities to be fully leveraged. For instance, Computer Vision applications are highly resource-intensive (compute, network and storage), and often require high performance infrastructure for the AI models to successfully run in production. This may necessitate advanced software engineering workloads (and the premium talent to deliver those workloads) that otherwise may not be needed in traditional application development.
- Secondly, data is the fuel of AI applications. Hence, architecting the discovery, capture, processing, storage, security and labeling of multiple data types (generated from a variety of sources) is integral to the architecture design of the overall AI application. Traditional data architectures often fall short in mission-critical AI applications, and new-age ‘out-of-the-box’ architectures may be needed.
- Thirdly, AI technologies are still evolving. This means that applications and systems that are built on these technologies have high chances of failures during development, deployment and/or maintenance. An optimal risk mitigation step is to determine the points of failures, assess the potential impacts, and ensure recovery provisions, or other strategies. This is where end-to-end AI architecture designs, and the trade-off options among multiple factors come into play.
A systems-driven approach to architectural design is a critical factor of success in AI application development. The broad philosophy here is to decompose complex systems into abstracted structures of components (and sub-components), and establish their inter/intra relationships in order to gain a better understanding about them, thus paving the way for smooth execution, implementation and monitoring.
At the same time, teams must be careful in avoiding some known architectural mistakes that are listed below.
- AI architects must be conscious of the ‘Ivory-Tower Architecture Trap‘, and avoid over emphasizing on perfect architectures that may not be entirely reflective of the scenarios, constraints, dependencies of the real-world.
- There is often a tendency to force-fit certain architectural patterns due to hype about these patterns, limited knowledge of the architects, or other factors. The ‘No Free Lunch’ theorem applies here as well, and the selected architectural patterns must be ‘genuinely’ relevant to the business problems.
- Avoid ‘Over-Architecture‘ of AI applications such as adding presumptive features, unnecessary components and logic, or new models that might ‘possibly’ be needed at some point in the long-term. In general, a good architecture design ensures that future features and non-functional requirements, as and when needed, can be addressed with minimum technical challenges.
Closing Comments
Architecture design has often been considered an ‘Engineering exercise’, and not a Data Science or Machine Learning activity. While this approach may have worked in the past (and may still work in select cases), it does not truly reflect the needs of the AI-age. There is a compelling need today to build AI applications that can successfully run in large-scale production environments, and the first step to achieving this is designing robust application or solution architectures. Furthermore, there is also a need to integrate (or assimilate) these architectures with enterprise, technology, infrastructure and data architectures.
Top AI Tech companies are leading the industry in strong architecture design. There are just a handful of them though, and some are even accused of tightly linking their architectures to their own products to control the global AI developer market. Having said that, these companies do contribute to the open-source community, and to the maturity of AI architecture standards. All these are small steps but great signs for the future. It is up to the rest of the AI adopters now to implement sound architecture design policies, and enable the development of world-class AI applications.