Platform Engineering Tools Can HURT More Than HELP
Summary
This YouTube video transcript discusses the “power tools problem” in platform engineering, arguing that while tools like Kubernetes, Kafka, and Istio are powerful, they are often a bad choice for platform engineering in 2024 because they add unnecessary complexity and burden to delivery teams.
The speaker, Steve Smith from Equal Experts, explains that platform engineering should accelerate delivery teams without creating extra work for them. He highlights the value of platform engineering in achieving engineering excellence (speed, quality, reliability) and references the metrics in Dr. Nicole Forsgren’s “Accelerate” book. He shares his experience with platform engineering successes and failures in various organizations, emphasizing that it’s easy to get platform engineering wrong.
The core issue is the “power tools problem,” where platform teams implement core capabilities using heavyweight tools like Kubernetes for container orchestration, Kafka for messaging, and Istio for service mesh. These tools, while powerful and feature-rich, require significant effort to learn, build, run, and support, both for the platform team and the delivery teams who must interact with them. The speaker stresses that the “babysitting cost” (operational overhead) of these tools is often overlooked compared to cloud costs.
Delivery teams are forced to spend time configuring and maintaining these complex tools, diverting their focus from core product features and customer value. Even managed services of these tools, like GKE or Confluent Kafka, are not truly simple and still impose cognitive load. This leads to unplanned technical work, increased rework rate, and developers spending more time on configuration (YAML files) than business logic.
The speaker provides two anonymized examples:
- American Broadcaster: A platform team insisted on using AKS (Kubernetes on Azure), leading to six months of arguments with delivery teams who wanted simpler solutions. The platform team won, but the six-month delay and implementation caused significant business justification issues for the platform lead.
- German E-commerce Company: A platform team heavily invested in self-hosting Istio and its advanced features. They were hesitant to upgrade due to past outages and relied on a Dutch company’s blog posts for upgrade instructions, creating an absurd and unsustainable technology dependency.
The root cause of the “power tools problem” is platform teams working “inside out” – focusing on the tools they want to use rather than working “outside in” – focusing on the capabilities their delivery teams truly need. Kubernetes, Kafka, and Istio, while good tools in general, don’t always provide enough unique value in the context of platform engineering to justify their high total cost of ownership (TCO), especially in terms of cognitive load and operational overhead for delivery teams.
The solution proposed is to shift from heavyweight to lightweight capabilities. This involves replacing:
- Kubernetes with something like ECS on Fargate.
- Kafka with something like Kinesis.
- Istio with something like App Mesh.
The speaker emphasizes that the intent behind tool choices is more important than the specific tools themselves. The intent should be to simplify life for delivery teams. He recommends using the Strangler Fig pattern for replacing heavyweight capabilities with lightweight alternatives, highlighting that platform engineering should leverage well-established technical practices.
The process for solving the “power tools problem” involves:
- Identify the most painful platform capability for delivery teams. Ask them directly about their biggest frustrations.
- Announce the current capability as “version one” and lock it down for new services.
- Build a lightweight “version two” of that capability.
- Put new services on version two and migrate existing workloads from V1 to V2.
The benefits of moving to lightweight tools are:
- Potential Acceleration: While both heavy and lightweight tools can accelerate teams, lightweight tools often do so more easily by reducing friction.
- Reduced Cognitive Load: Significantly decreases the mental burden on delivery teams, freeing them to focus on business logic.
- Reduced Unplanned Tech Work: Minimizes the time spent on configuration and troubleshooting platform complexities.
- Lower Platform Team Costs: Lightweight tools are simpler to build, run, and maintain, allowing the platform team to focus on new features and improvements.
The speaker concludes by promising to discuss another platform engineering problem in a future video, pending the arrival of a parrot fish for his colleague.
Accuracy
The information presented in the transcript is largely accurate and reflects a valid perspective on platform engineering challenges in 2024. Here’s a breakdown of accuracy points:
- “Power Tools Problem” is a real concern: The core argument about the cognitive load and operational overhead of complex tools like Kubernetes, Kafka, and Istio in platform engineering is valid and widely recognized. Many organizations have experienced the challenges of adopting these technologies without fully considering their impact on developer experience and overall productivity.
- Kubernetes, Kafka, Istio complexity: The transcript accurately portrays these tools as powerful and complex. They require specialized expertise to manage and can be overkill for simpler applications or organizations with limited platform engineering maturity.
- Managed Services are not a silver bullet: The point that managed services for these tools simplify operations but don’t eliminate complexity or cognitive load for delivery teams is also accurate. Configuration, understanding underlying concepts, and troubleshooting still require effort.
- Focus on developer experience (outside-in approach): The emphasis on prioritizing delivery team needs and working “outside-in” is a fundamental principle of effective platform engineering and aligns with best practices like DevOps and platform-as-a-product thinking.
- Strangler Fig pattern relevance: Applying the Strangler Fig pattern for migrating from heavyweight to lightweight platform capabilities is a sound and well-established approach for legacy system modernization and platform evolution.
- ECS Fargate, Kinesis, App Mesh as lighter alternatives: These AWS services are generally considered simpler and more managed alternatives compared to Kubernetes, Kafka, and Istio, respectively, particularly for organizations already invested in the AWS ecosystem. They offer a lower operational burden at the cost of some flexibility and control.
- Cognitive load impact: The transcript correctly highlights the negative impact of high cognitive load on delivery teams, leading to reduced productivity, increased errors, and slower delivery cycles.
- Examples are realistic: The anonymized examples, while anecdotal, are representative of real-world challenges faced by organizations adopting complex platform technologies without careful planning and consideration of developer needs.
- Accelerate metrics relevance: Referencing the “Accelerate” book and its metrics (like rework rate) is relevant as these metrics are widely used to measure software delivery performance and can be directly impacted by platform engineering choices.
Nuances and Considerations:
- Context Matters: While the argument against “power tools” is valid in many contexts, it’s crucial to remember that Kubernetes, Kafka, and Istio are still valuable tools for specific use cases. For organizations with highly complex, large-scale, or specialized applications, these tools might be necessary and justifiable. The transcript’s message is primarily targeted at organizations that may be over-engineering their platforms or adopting these tools prematurely without a clear understanding of their costs and benefits.
- Oversimplification of alternatives: While ECS Fargate, Kinesis, and App Mesh are simpler in some ways, they are not without their own complexities and limitations. The “lighter weight” argument should be understood as relative and context-dependent. Choosing the “right” tool always involves trade-offs.
- Platform Engineering Maturity: The “power tools problem” is often more pronounced in organizations with less mature platform engineering practices. As platform teams mature and develop better abstractions and automation, they can potentially manage complex tools more effectively and reduce the burden on delivery teams.
- Tooling is not the only factor: While tool choices are important, successful platform engineering also relies on organizational culture, collaboration, clear communication, and well-defined platform product management practices.
Overall Accuracy Assessment: The transcript provides an accurate and insightful perspective on the potential pitfalls of over-engineering platform engineering with complex tools. It raises valid concerns about cognitive load, operational overhead, and the importance of focusing on developer experience. While it presents a somewhat simplified view of tool choices, the core message about prioritizing delivery team needs and starting with simpler solutions is highly relevant and valuable in the current landscape of platform engineering.
Resources
Here are the top 5 most relevant resources to learn more about the subjects presented in the transcript:
-
Team Topologies: Organizing Business and Technology Teams for Fast Flow by Matthew Skelton and Manuel Pais: This book provides a framework for organizing teams and platforms for fast software delivery. It directly addresses the concepts of platform teams, stream-aligned teams (delivery teams), and enabling teams. It helps understand how to structure platform teams to effectively support delivery teams without imposing unnecessary burdens, which is central to avoiding the “power tools problem.” It emphasizes team cognitive load and stream-aligned team autonomy.
-
Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations by Nicole Forsgren, Jez Humble, and Gene Kim: Referenced in the transcript, this book is essential for understanding the metrics that matter for software delivery performance (speed, quality, reliability, and organizational performance). It provides the scientific basis for DevOps practices and platform engineering’s role in improving these metrics. Understanding these metrics helps justify platform engineering investments and measure the impact of tool choices on overall performance, including the negative impacts of “power tools” if not implemented thoughtfully.
-
Platform Engineering by Humanitec (Website and Community): Humanitec is a company focused on Platform Engineering. Their website offers a wealth of resources, including blog posts, articles, and documentation on platform engineering principles, practices, and tools. They also host a community forum and events dedicated to platform engineering. This is a practical and up-to-date resource for understanding the latest trends and best practices in platform engineering, including discussions around tool choices and developer experience.
-
The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations by Gene Kim, Jez Humble, Patrick Debois, and John Willis: This book provides a comprehensive guide to DevOps principles and practices, which are foundational to platform engineering. It covers topics like continuous delivery, feedback loops, and creating a culture of collaboration and learning. Understanding DevOps principles is crucial for building effective platforms that support fast, reliable, and secure software delivery. It provides the broader context for why focusing on developer experience and reducing friction is so important.
-
Cloud Native Patterns (Website and Book by Cornelia Davis): This resource focuses on patterns and best practices for building cloud-native applications, often involving technologies like Kubernetes, but also emphasizing the importance of abstractions and developer experience. It helps understand how to use complex tools like Kubernetes effectively and how to build platforms that simplify their use for application developers. While the transcript argues against “power tools” in some contexts, this resource provides insights into how to use them well when they are necessary, and also when to choose simpler alternatives. It encourages thinking about appropriate levels of abstraction and not exposing unnecessary complexity to developers.
These resources offer a combination of theoretical frameworks (Team Topologies, Accelerate), practical guidance (DevOps Handbook, Platform Engineering resources), and specific patterns for cloud-native architectures (Cloud Native Patterns). They collectively provide a strong foundation for understanding platform engineering, the “power tools problem,” and how to build effective platforms that empower delivery teams.