Putting AI and Machine Learning to Work in Cloud-Based BI and Analytics
Machine learning (ML) in the cloud is powering a whole new generation of intelligent and predictive cloud analytics solutions like Azure Databricks and Azure Synapse. The benefits of cloud economics, tooling and flexibility, along with next-level insights to drive real time business decisions are the primary drivers behind the growing trend of on-premise data lake migrations to the cloud.
Cloud analytics services like Synapse are designed to collect and analyze current and actionable data – delivering insights into processes and workflows that can impact business operations. But what if you need those insights immediately, and you need them in the hands of employees and experts who are working simultaneously across the globe in real time and always accurate and up to date? IT stakeholders are turning to the cloud for faster, more accurate and timelier business insights – especially in the face of Covid-19 where companies are looking to operate as economically possible and millions are forced into remote working locations.
Even before the pandemic, a 2019 survey by TechTarget found that 27% of respondents plan to deploy cloud analytics in 2020. That same study points to an increase in cloud technology as the number two activity that companies are employing to improve employee experience and productivity, and notes that 38% of companies plan to bolster their cloud technology in 2020. In speaking to the experts at AWS and Azure, that number is higher today. Hindsight is also 20/20!
There are multiple reasons that organizations are moving their data lakes and analytics capabilities to the cloud. First among them is cost: the move streamlines a workforce, so even though there are start-up costs involved in the migration process, the long-term cost-benefit analysis plays out in their favor. Companies are also able to run faster and lighter with cloud analytics with no need to run dedicated client-side applications and IT teams freed of the necessity of coordinating upgrades across an entire infrastructure. In our experience across our customer base at WANdisco and in working with CSPs like Azure and AWS, we have found, on average, that the total cost of ownership to manage a 1PB Hadoop data lake on premise over a three year period costs a company $2M. To manage that same 1PB in AWS S3 or Azure ADLS Gen 2 storage costs $900,000 over three years.
The question is how to most rapidly (time to value) migrate that 1PB data lake with zero downtime and ensuring the data is consistent on prem and in the cloud during migration as the data is always changing if it’s business critical. The architects and data teams have two choices.
They can use various flavors of open source DistCP tools and scripts, which is the manual approach to a data lake migration. Don’t be fooled by fancy names by the Hadoop or Cloud vendors. It’s all DistCP under the covers. What’s wrong with this approach? It’s an IT project. And like most IT projects, 61% of them either fail or suffer cost and SLA overruns. Here’s what you have to do in this scenario:
- Find a project manager to run the entire project
- Find a business analyst to define requirements
- Look for a Hadoop and cloud architect to review requirements and design a solution
- Tap into an already overworked development team to take on the DistCP scripting work within an existing sprint
- Do unit testing and then validation testing
How long can this take? We have seen teams struggle for months and even years depending on data volume and business requirements around acceptable application downtime, data availability and data consistency. We’ve seen companies put 8-10 people on projects, fail after 6 months, then pay $1M to a systems integrator and fail after another 9 months. OUCH.
There is a better way. And forward-looking companies like AMD, Daimler, and many others have figured it out. How?
By leveraging modern technology to automate data lake migration and replication to the cloud with WANdisco LiveData Cloud Services through its patented Distributed Coordination Engine platform.
This innovation is founded on fundamental IP which is based around forming consensus in a distributed network. This is an extremely hard problem to solve and to this day some people believe that it cannot be solved. So what is this problem at a high level? If you have a network of nodes, distributed across the world with little to no knowledge of the distance and bandwidth between the nodes, how can you get the nodes to coordinate between each other without worrying about any failure scenarios?
The solution is the application of a consensus algorithm and the gold standard in consensus is an algorithm called Paxos. Our chief Scientist Dr. Yeturu Aahlad, an expert in distributed systems, devised the first, and even now only, commercialised version of Paxos. By doing so, he solved a problem that had been puzzling computer scientists for years.
WANdisco’s LiveData Cloud Services are based on this core IP including our products focused on analytical data and the challenge of migrating this data to the cloud and keeping the data consistent in multiple locations.
As businesses request to have data available in a more and more decentralized environment, the old mechanisms to provide and manage data are not sufficient anymore. Moreover, the amount of data is rising exponentially which leads to a phenomenon called data gravity. With an increasing volume of data, the more it is a challenge to provide this in a distributed environment, allow changes to the data in any environment, and ensure it remains consistent across all environments. Additionally regulation and compliance requirements make it even more challenging for data managers to fulfil businesses needs.
As enterprises look to leverage the scale and economics of the cloud, WANdisco offers a fundamentally different approach to manage these large volumes of data accelerating the ability for enterprises to undergo digital transformation.
Here’s what Merv Adrian, Research VP of Data and Analytics at Gartner had to say, “WANdisco’s ability to move petabytes of data without interrupting production and without risk of losing the data midflight is something no other vendor does and, until now, has been virtually impossible to accomplish.”
The Bottom Line
Cloud computing has completely transformed entire industries, computing paradigms and enterprises, and has become the ideal for storing and accessing big data sets. The Covid-19 pandemic has only accelerated this move given the need to operate as economically as possible with more employees working remotely. Cloud computing saves both money and time, which makes it immediately attractive to businesses, while also increasing access for global companies, providing a synergic platform for coordination and cooperation between far-flung employees. 85% of the Fortune 500 have moved to the cloud and continue to do so. The migration of static data has been easy. The challenge now has been how to quickly migrate and replicate large on-premises data lakes and applications to the cloud, when the data is business critical and application downtime, data loss and inconsistencies cannot be tolerated. The good news is that now there is a better way via automated migration and replication that delivers 10X faster time to value, is 100% safer, while ensuring zero downtime during migration.