This blog post is a continuation of our series 5 Reasons Why AI Fails to Scale. To see the last blog in this series, click here.
Building AI products and services is a complex undertaking. For team leaders, building and maintaining the AI team is often the first and biggest hurdle to getting up to scale. Most of these concerns are similar to the process of organizing a good engineering team — hiring and retaining the right talent, collaborating effectively, managing the best use of the expert team members’ time.
But unlike with standard SWE teams, the amorphous, nascent, and highly skills-dependent nature of ML tasks can cause even seasoned engineering leaders to fail in establishing a robust organization. The outcome can often be disastrous, because organizational issues in the AI team result in brittle AI applications with poor explainability, and we’ve heard many examples of companies deciding that ‘maybe this AI thing just isn’t for them’.
Here are five organizational issues that might be holding your AI team back from scaling.
1. Not setting up the right structure
The “AI Team” refers to all of the members of your organization that do machine learning. This might prove to be a larger group than you’d originally think, depending on the breadth and scope of AI projects at your organization. For example, if your company’s core product or service offering is AI-enabled, then you should expect to have a larger team of ML capable engineers than if your AI use case in powering internal data analytics processes.
In some cases, a core cadre of data scientists and ML Engineers is called the AI team, leaving a wall between the infrastructure-concerned individuals, who operate in a different silo. Sometimes, data engineering, database admins, and a business analyst team get included in the AI team at large, paving the way for scope creep that will flummox both the AI team leader and the stakeholders expecting results that may be misaligned with the research-intensive nature of ML workflows.
In our experience, it is important to have a strong AI team leader in place — someone who can effectively interface with stakeholders while having sufficient understanding of engineering, data science, and ML infrastructure. With good leadership and a clear mission, the structure of the team can be established around core activities in a framework. For instance, we recommend examining the right positioning of boundaries between the AI team and DataOps, IT infrastructure, general software engineering, and business analytics, based on the mission. Internally, it is good to remember that the activities involved in experimenting, building, deploying, managing, and optimizing resources or operationalizing can flow across roles. Which brings us to the next major organizational issue.
2. Undefined and Overlapping Skillsets
There are myriad classifications of the roles that fit into an AI team — most ML software providers have a point of view that includes five or more roles, each with multiple variants or ‘sub-classes’. We’ve heard of Data Scientists, Applied Scientists, Research Scientists, Data Engineers, ML Engineers, MLOps Engineers, Infra / HPC admins, product managers, and of course, multi-disciplinary team leaders. Instead of focusing on the roles, of which there are countless permutations, we like to focus on the responsibilities.
This is where it gets tricky — the fact is that there is no obvious prescription for the composition of an AI team, and in a high-velocity, experimental environment, different team members often take on responsibilities that are well outside their realm of expertise. We’ve seen data scientists who struggle to write hardened code for deployment, and we see engineers with limited knowledge of models tinkering with pipelines to get them ready for production. Worst of all, we have seen time and again that the optimization and management of infrastructure gets left by the wayside until the debt of scaling has ramped up.
With undefined requirements of talent, there is an inevitable build-up of inefficient work. It does appear that some roles are becoming much more neatly defined, with a class of engineers laying down boundaries around the now-prevalent roles of ML Engineers and MLOps Engineers. Which brings us to the next organizational issue.
3. Scarcity of Talent
If you’ve tried hiring an ML Engineer over the last year, you will have noticed the rarity of talent in this space. From our research it appears to take up to 6 months to hire an ML Engineer, and even more for an MLOps Engineer. While the market for other talent may be slowing down in the economic downturn of 2022, the supply of talented and experienced individuals remains limited for this role.
ML Engineers are a scarce and valuable resource. Companies compete over salaries and qualified candidates field many offers. When teams are beginning their first AI projects, often with a data scientist paired with a software engineer, they quickly realize the need for the cross-domain expertise of the ML Engineer. After initial prototype and MVP, even more ML talent is needed to take products to production. And given the living, high-maintenance, and esoteric nature of most AI applications, the need for more MLEs expands with scale of the team’s work.
We understand that there are two ways around this issue, which is fundamentally about productivity. One step that many companies attempt is to train engineers or data scientists to cover the gaps. Indeed, this can be a great bandaid and can even lead to uncovering the star power of some employees. The other step is to invest in the tools that can multiply the productivity of you existing MLEs, siphon away rudimentary tasks into self-serve by other team members, and eliminate the silos that cause them to become bottlenecks. Which brings us to the most common issue that we hear about- bottlenecks.
4. Bottlenecks in Expertise
Bottlenecks are perhaps the primary reason why the field of operations, then DevOps, and now MLOps came to exist. The expertise bottlenecks are particularly dangerous when put into perspective with the talent shortage issue — losing someone that is a bottleneck can pose an existential risk to operations. For the AI team, there are three types of bottlenecks that we see most often.
The so-called “universal expert” is the most common bottleneck that we see in larger organizations, where multiple models have been deployed into production but teams have continued to evolve. There is often that one person, usually a senior MLE or a data scientist, who understands what actually happened to build each AI project. We recommend that every team figure out who these people are, and bring tools to help convert their institutional knowledge into frameworks, standards, and systems engraved in a platform.
Domain experts, or sometimes model-specific experts, are also a common bottleneck — particularly in verticals where there is a deeply specific type of problem to be solved. For instance, we have seen individuals become bottlenecks in AI systems in heavy industry use cases, where advanced knowledge of chemical or physical processes needs to be combined with a framework for modeling. The solution to this is the same — support the domain expert with productivity tools, and compose a custom platform that inherits the expertise from the infrastructure up.
The final expertise bottleneck that is prevalent is usually found with the MLE and MLOps Engineer. We think of this as the ‘deployment’ or ‘productionizing’ bottleneck. Most AI teams that are starting out begin with the MVP mentality, and indeed it is necessary to maintain this approach if you aim to be research-focused and continuously improving. The problem is that most teams forget that there is a lot of work involved in ‘getting a model into production’ that has a non-trivial amount to do with the pipeline itself and a lot to do with DevOps and software engineering practices. Not addressing this bottleneck leads not only to slow deployments — it also ends up in poor SLA as the maintenance and management of those pipelines becomes a burden. And if incorrectly optimized, infrastructure resource costs can balloon very quickly. If you want to overcome this bottleneck, you will want to do the same thing as with the other bottlenecks — address productivity for the MLEs, and get the whole team on to custom platform that increases self-serve and reduces the “throw it over the wall” mentality. Which brings us to our final organizational issue.
5. Onboarding and Collaboration
AI teams often struggle to communicate internally due to vastly varied skill sets, languages and expertise. This is compounded when you consider that larger organizations often have multiple AI teams for different business units, often doing redundant work without optimizing scaling compute infrastructure.
The most common issue that we encounter in teams that aren’t collaborating effectively is the “someone else’s problem” field that tends to surround tasks that aren’t clearly aligned to a role’s OKRs. For instance, we hear often that Kubernetes management and cloud resource management is someone else’s problem, far outside of the purview of the AI team. Or we hear that a data scientist will send over their poorly annotated Python pipeline to the MLE, who has to decipher and rewrite pages of code, because “deployment is someone else’s problem”.
In the best case, this leads to simple inefficiencies and misalignments, that come to the surface in budgets and leadership meetings. In the worst case, these issues can lead to interpersonal fallouts, resulting in talent exodus.
The difficulties with poor collaboration standards are compounded by the fact that the talent for AI teams is constantly moving around. If you want to onboard a new team member you need the tooling to effectively transition. The answer to the problem of collaboration is not a simple one, but we believe it begins with alignment. When there is a single-pane-of-glass platform, accessible by everyone, not only does it allow team leaders to manage and orchestrate multi-disciplinary tasks, but it also increases empathy, leading to a stronger team. Just think about how Jira or team-focused tools support a well-functioning agile team, and you get the picture.
We believe that if you can get MLOps right, even the smallest AI teams achieve scale and eliminate the five most common organizational issues. That’s why we’re building the Petuum Platform. Our low code / no code UI empowers your entire team to build ML models so that a Data Scientist can deploy hardened code in production AI Apps as effectively as a seasoned ML Engineer, and a business-focused team leader can explain systems from monitoring all the way down to Kubernetes infrastructure.
In our next blog in this series, we will talk about infrastructure orchestration, and how it is a major reason why AI teams fail to scale up. Stay tuned and let us know your thoughts!