
Flaky tests have long been a source of wasted engineering time for mobile development teams, but recent data shows they are becoming something more serious: a growing drag on delivery speed. As AI-driven code generation accelerates and pipelines absorb far greater volumes of output, test instability is no longer an occasional nuisance.
This constant rise has been recorded by all manner of developers, from small teams to Google and Microsoft. The recently launched Bitrise Mobile Insights report backs up this shift with hard numbers: the likelihood of encountering a flaky test rose from 10% in 2022 to 26% in 2025. Practically, this means that the average mobile development team now encounters unreliable test results during a typical workflow run. That level of unpredictability has real consequences for organizations that depend on fast, confident release cycles. Flaky tests undermine trust in CI/CD infrastructure, force developers to repeat work and introduce friction at the point where stability matters most.
This rise in flakiness is not happening in a vacuum. Mobile pipelines are expanding rapidly. Over the past three years, workflow complexity grew by more than 20%, with mobile development teams running broader suites of unit tests, integration tests and end-to-end tests earlier and more often. In principle, this strengthens quality. In practice, it also increases exposure to non-deterministic behaviours: timing issues, environmental drift, brittle mocks, concurrency problems and interactions with third-party dependencies. As test coverage grows, so does the surface area for failure that has nothing to do with the code being tested.
At the same time, organizations are under pressure to move faster. The median mobile team is shipping more frequently than ever, with the most advanced teams shipping at twice the average speed of top 100 apps. Against this backdrop, any friction in CI becomes a material risk. Engineers forced to rerun jobs or triage false failures lose hours that could have gone towards work on new features. Build costs rise as pipelines repeat the same work simply to prove a failure was not real. Over the course of a week, a few unstable tests can cascade into significant delays.
Tracking Down the Flakiness
One of the most persistent challenges is the lack of visibility into where flakiness originates. As build complexity rises, false positives or flaky tests often rise in tandem. In many organizations, CI remains a black box stitched together from multiple tools as artifact size increases. Failures may stem from unstable test code, misconfigured runners, dependency conflicts or resource contention, yet teams often lack the observability needed to pinpoint causes with confidence. Without clear visibility, debugging becomes guesswork and recurring failures become accepted as part of the process rather than issues to be resolved.
The encouraging news is that high-performing teams are addressing this pattern directly. They treat CI quality as a top engineering priority and invest in monitoring that reveals how tests behave over time. The Bitrise Mobile Insights report shows a clear correlation: teams using observability tools saw measurable improvements in reliability and experienced fewer wasted runs. Improving visibility can have as much impact as improving the tests themselves; when engineers can see which cases fail intermittently, how often they fail and under what conditions, they can target fixes instead of chasing symptoms.
Increasing Observability Boosts Build Success
Better tooling alone will not solve the problem. organizations need to adopt a mindset that treats CI like production infrastructure. That means defining performance and reliability targets for test suites, setting alerts when flakiness rises above a threshold and reviewing pipeline health alongside feature metrics. It also means creating clear ownership over CI configuration and test stability so that flaky behaviour is not allowed to accumulate unchecked. Teams that succeed here often have lightweight processes for quarantining unstable tests, time boxing investigations and ensuring that fixes are prioritised before the next release cycle.
As automation continues to expand across the software development lifecycle, the cost of poor test reliability will only increase. AI-assisted coding tools and agent-driven workflows are generating more code and more iterations than ever before. This increases the load on CI and amplifies the effects of instability. Without a stable foundation, the throughput gains promised by AI evaporate as pipelines slow down and engineers drown in noise.
Flaky tests may feel like a quality issue, but they are also a performance problem and a cultural one. They shape how developers perceive the reliability of their tools. They influence how quickly teams can ship. Most importantly, they determine whether CI/CD remains a source of confidence or becomes a source of drag.
Stability will not improve on its own. Engineering leaders who want to protect release velocity and maintain confidence in their pipelines need clear strategies to diagnose and reduce flaky behaviour. Start with visibility, understanding when and where instability emerges. Treat your CI/CD infrastructure with the same discipline as production systems, and address small failures before they become systemic ones. Once development teams are on top of flaky testing, they build a competitive advantage, improving release velocity and quality, and focusing on what matters most: the mobile user experience.
