
Background: Stacks with four-letter acronyms
According to Wikipedia, the LAMP stack was coined in 1998 by Michael Kunze to describe what had emerged as a popular open source software stack for websites. When the World Wide Web exploded in popularity earlier in the ’90s, organizations used an ad hoc mixture of proprietary tools and operating systems, along with some open source software (OSS), to build websites. The LAMP stack quickly became the most popular set of fully OSS components for this purpose.
LAMP is an acronym that stands for the following:
- Linux – the operating system
- Apache HTTP Server – the web server
- MySQL – the database
- Perl, PHP, and/or Python – the application programming language
It is hard to believe this today, but at the time, the idea of relying on open source software was controversial. Concerns about support and vulnerability since the source code is visible to everyone were eventually resolved. Open source was irresistible because of the great flexibility, cost efficiencies, no vendor lock-in, and rapid evolution of capabilities provided by popular OSS projects. The LAMP stack became one of the predominant drivers of enterprise adoption of open source.
The PARK stack
Like the rise of the web, the sudden explosion of interest in generative AI with large language models (LLMs), vision models (VMs), and others has driven interest in identifying the best core OSS components for a software stack tailored to the requirements for generative AI. This era now has the PARK stack. It was first suggested by Ben Lorica in “Trends Shaping the Future of AI Infrastructure,” in November last year.
PARK stands for the following:
- PyTorch – for model training and inference
- AI models and agents – the heart of generative AI
- Ray – for fine-grained, very flexible distributed programming
- Kubernetes – the industry-standard cluster management system
Here, I will provide a brief description of each one and the requirements it meets.
PyTorch
The AI stack needed by model builders provides the ability to train and tune models. Application builders need efficient, scalable inference with models and the agents that use them.
PyTorch started as one of many tools for designing and training a variety of machine learning models. It’s now the most popular choice for this purpose. It is used to design and train many of the world’s most prominent generative AI models. Alternatives include JAX and its predecessor, TensorFlow.
PyTorch was developed and open-sourced by Meta. It is now maintained by the PyTorch Foundation. The ecosystem has expanded to include other projects, such as for inference (vLLM), distributed training and inference (DeepSpeed and Ray), and many libraries.
The cost of model inference drives the need for specialized and highly optimized inference engines, like vLLM. So, PyTorch is rarely used alone for inference, although the popular inference engines use PyTorch libraries.
Incidentally, the rise of generative AI has also caused a resurgence in popularity for Python, in part because Python has been the most popular language for data science, of which generative AI is a natural part.
AI models and agents
The unique capabilities of generative AI applications are provided by one or more models and agents that use them. The first wave of AI applications, often simple chatbots, used a single model that had been trained to understand human language very well, especially English, then tuned in various ways to use that language skill more effectively, such as answering questions, avoiding undesirable speech, providing factual output, etc.
Model architecture has rapidly evolved, including making smaller, more capable models and using collections of models (such as the mixture of experts architecture) that provide better efficiency while maintaining result quality.
However, models have some particular shortcomings. For example, they know nothing of events that happened after they were trained and they are not trained on all possible specialist data needed to be effective for every possible domain. Hence, application patterns rapidly emerged to complement the strengths of models. The first pattern was RAG (retrieval-augmented generation), where a repository of data is queried for relevant context information, which is then sent as context with the user query to a model for inference.
The more general approach today is agents, which have been defined this way, “software systems that use AI to pursue goals and complete tasks on behalf of users. They show reasoning, planning, and memory and have a level of autonomy to make decisions, learn, and adapt.” Pursuing user goals can mean finding and retrieving relevant contextual data, evaluating the quality and utility of retrieved information, summarizing findings, gracefully recovering from errors, etc.
There is no one dominant model choice or even “family” of models. Similarly, there is no one agent framework to rule them all. This reflects both the very rapid evolution of models and agent design patterns but also the diversity of possible AI applications, which makes it unlikely that any one choice will meet all needs.
Want Radar delivered straight to your inbox? Join us on Substack. Sign up here.
Ray
Model training, various forms of tuning, and inference of models require different distributed computing patterns that require highly optimized implementations, given the large energy consumption and related costs associated with generative AI. Single GPU systems are too small for these tasks for the largest generative models. Even for smaller models, massive parallelism allows these processes to scale more effectively.
For model training and tuning processes that involve additional training with new data, a massive number of iterations are used, where in each loop, data is passed through the model, and the model parameters (weights) are adjusted incrementally to reduce errors. These iterations must be fast and efficient. When the model parameters are distributed over several GPUs, very high bandwidth exchange of updates is required. Training iterations have large memory footprints and massive data exchanges.
Reinforcement learning is another part of tuning, used to improve more complex behaviors for domains. RL also requires massive amounts of fast iterations, but the size scales and data access patterns are typically smaller, more fine-grained, and more heterogeneous.
Finally, inference distributed computing patterns are the same as the first step in a training iteration, where data flows through a model, but there isn’t a parameter update step.
Ray provides the flexibility for these disparate requirements. It is a fine-grained distributed programming system with an intuitive actor model abstraction. Ray was developed by researchers at the University of California, Berkeley, who needed an efficient and easy-to-use system for scaling up computation required for their reinforcement learning and AI research. The flexibility of Ray’s abstractions and the efficiency of its implementation makes Ray well suited for the new distributed computing requirements generative AI has introduced.
Anyscale is a startup focused on productizing Ray. Ray’s core OSS was recently donated to the PyTorch Foundation, as mentioned above.
Kubernetes
Large scale model training and tuning, as well as scalable application deployment patterns, introduce many practical requirements, including management of clusters of heterogeneous hardware and other resources, as well as the processes running on them. Kubernetes has been the industry standard for cluster management for a decade, emerging from Google’s work on Borg, along with contributions from many other organizations. Kubernetes is part of the Linux Foundation. The main alternatives to Kubernetes are the management tools offered by the cloud vendors, AWS, Microsoft Azure, Google Cloud, and others. The advantage of Kubernetes is that it runs seamlessly on these platforms (offered as a service or you can “roll your own”), as well as on-premises, providing the benefits of the cloud services but without vendor lock-in.
At first glance, it might appear that the distributed capabilities of Ray and Kubernetes overlap, but in fact they are complementary. Ray is for very fine-grained and lightweight distributed computing and memory management, while Kubernetes provides more coarse-grained management and a broad suite of application services required in modern environments (like security, user management, logging and tracing, etc.). It is common for a containerized Ray application to run its own concept of clustered processes within a set of containers in a Kubernetes cluster. Ray and Kubernetes bring complementary strengths. In fact, there is the open source KubeRay operator which allows you to use Ray on Kubernetes without having to be an expert in Ray or container management.
What’s missing from PARK?
LAMP was never intended to provide everything needed for website deployments. It was the core upon which additional services were added as required. PARK is similar, although the presence of Kubernetes covers a lot of the general-purpose service requirements!
For generative AI applications, PARK users will have to think about new requirements, in addition to all the standard practices we have used for years. Let’s discuss a few topics.
Data and data management
Conventional data management requirements and practices still apply, but AI agents are driving changes too. Ben’s post on data engineering for machine users discusses a number of trends. For example, some providers are seeing agents dominate the creation of new database tables and those tables are often ephemeral. Agents are less tolerant of database query problems compared to humans and agents are less careful about security concerns.
Unstructured, multimodal data is growing in importance; video and audio as well as text. Use of specialized forms of structured data is also growing, like knowledge graphs and vector databases for RAG applications, and feature stores for structuring data more effectively.
Agent orchestration
Any distributed system needs careful management of the interactions between components, for purposes of security, resource management, and efficacy. The Model Context Protocol (MCP) and the Agent2Agent Protocol (A2A) are two of several emerging standards to allow models to discover available agent services and learn how to use them automatically. These promising capabilities also raise many concerns about security and the need for careful control, which is driving the emergence of new gateway and service projects tailored to the specific needs of agent-based applications, for example, ContextForge. Similarly, supporting features are being added to established tools to meet the same needs.
Memory management
Agents must manage and use the information they have acquired. This includes working within the available context limitations for their models and focusing on the most useful information, to optimize their use of resources and effectiveness. AI agent memory is an ongoing research topic with projects and startups emerging, like MemVerge and Mem0, which emphasize the effective use of short-term (i.e., single session) memory. Established persistence tools are also being applied to the problem, e.g., Neo4j and Redis, which also support longer-term memory across sessions.
Dex is a new approach that addresses a particular challenge caused by MCP and A2A: the explosion of information that gets added to the inference context memory. This memory is limited and performance quickly degrades when the context grows too large. Dex takes what an agent learns how to do once, like using MCP to learn how to query GitHub for repo information, and turns that knowledge into reusable code that both eliminates unnecessary repetition of the learning step and executes the task deterministically outside the model context. Dex also provides a form of long-term memory.
What’s next?
What are your thoughts about the PARK stack? What do you think of the four components versus alternatives? What AI application requirements do you think need more attention? Let us know!
