|By Rene Buest||
|March 23, 2017 10:00 AM EDT||
Amazon Web Services (AWS) broke the Internet again or better "a typo". On February 28, 2017, an Amazon S3 service disruption in AWS' oldest region US-EAST-1 shuts down several major websites and services like Slack, Trello, Quora, Business Insider, Coursera and Time Inc. Other users were reporting that they were also unable to control devices which were connected via the Internet of Things since IFTTT was also down. Those kinds of disruptions are becoming more and more business critical for today's digital economy. To prevent these situations, cloud users should always consider the shared responsibility model in the public cloud. However, there are also ways where Artificial Intelligence (AI) can help. This article describes that an AI-defined Infrastructure respectively an AI-powered IT management system can help to avoid service disruptions of public cloud providers.
Amazon S3 Service Disruption - What has happened
After every service disruption AWS writes a summary of what was going on during an incident. This is what happened on the morning of February 28.
"The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable."
Bottom line, a typo crashed the AWS powered Internet! AWS outages already have a long history and the more AWS customers running their web infrastructure on the cloud giant, the more issues end customers will experience in the future. According to SimilarTech only Amazon S3 is already used by 152,123 websites and 124,577 unique domains.
However, following the philosophy of "Everything fails all the time (Werner Vogels, CTO Amazon.com)" means if you are using AWS you must "Design for Failure". Something cloud role model and video on demand provider Netflix is doing in perfection. In doing so, Netflix has developed its Simian Army an open source toolset everyone can use to run a cloud infrastructure on AWS high-available.
Netflix "simply" uses the two levels of redundancy AWS offers. Multiple regions and multiple availability zones (AZ). Multiple regions are the masterclass of using AWS, very complex and sophisticated since you must build and manage entire separated infrastructure environments within AWS' worldwide distributed cloud infrastructure. Multiple AZs are the preferred and "easiest" way for high availability (HA) on AWS. In this case, the infrastructure is built within more than one data center (AZ). In doing so, a single region HA architecture is deployed in at least two or more AZs - a load balancer in front of it is controlling the data traffic.
However, even if "typos" shouldn't happen the recent accident shows, that human error is still the biggest issue running IT systems. In addition, you can blame AWS only to a certain extend since the public cloud is about shared responsibility.
Shared Responsibility in the Public Cloud
An important public cloud detail is the self-service. Depending on its DNA the providers are only taking responsibility for specific areas. The customer is responsible for the rest. In the public cloud, it is about sharing responsibilities - this model is called Shared Responsibility. The provider and its customers divide the field of duties among themselves. In doing so, the customer's self-responsibility plays a major role. In the context of IaaS utilization, the provider is responsible for the operations and security of the physical environment. He is taking care of:
- Set up and maintenance of the entire data center infrastructure.
- Deployment of compute power, storage, network and managed services (like databases) and other micro services.
- Provisioning the virtualization layer customers are using to demand virtual resources at any time.
- Deployment of services and tools customers can use to manage their areas of responsibility.
The customer is responsible for the operations and security of the logical environment. This includes:
- Set up of the virtual infrastructure.
- Installation of operating systems.
- Configuration of networks and firewall settings.
- Operations of own applications and self-developed (micro) services.
Thus, the customer is responsible for the operations and security of his own infrastructure environment and the systems, applications, services, as well as stored data on top of it. However, providers like Amazon Web Services or Microsoft Azure provide comprehensive tools and services customers can use e.g. to encrypt their data as well as ensure identity and access controls. In addition, enablement services (micro services) exist that customers can adopt to develop own applications more quickly and easily.
In doing so, the customer is all alone in its area of responsibility and thus must take self-responsibility. However, this part of the shared responsibility can be done by an AI-defined IT management system respectively an AI-defined Infrastructure.
An AI-defined Infrastructure can help to avoid Service Disruptions
An AI-defined Infrastructure can help to avoid service disruptions in the public cloud. However, the basis of this kind of infrastructure is a General AI that combines three major human abilities that enable enterprises to tackle IT and business process challenges.
- Understanding: By creating a semantic data map the General AI understands the world of the company in which its IT and business exists.
- Learning: By creating Knowledge Items the General AI learns best practices and reasoning from experts. Knowledge is taught in atomic pieces of information (Knowledge Items) that represent separate steps of a process.
- Solving: With machine reasoning problems are solved in ambiguous and changing environments. The General AI dynamically reacts to the ever-changing context, selecting the best course of action. Based on machine learning the results are optimized through experiments.
To put this into the context of an AWS service disruption:
- Understanding: The General AI creates a semantic map of the AWS environment as part of the world in which the company exists.
- Learning: IT experts create Knowledge Items while they are configuring and working with AWS from what the General AI learns best practices. Thus, the experts teach the General AI contextual knowledge that includes what, when, where and why something needs to be done - for example when a specific AWS service is not responding.
- Solving: The General AI dynamically reacts to incidents based on the learned knowledge. Thus, the AI (probably) knows what to do at this very moment - even if no high availability setup was considered from the beginning.
Frankly speaking, everything described above is no magic. Like every new born organism an AI-defined Infrastructure needs to be trained but afterwards can work autonomously as well as can detect anomalies as well as service disruptions in the public cloud and solve them. Therefore, you need the knowledge of experts who have a deep understanding of AWS and how the cloud works in general. These experts need to teach the General AI with their contextual knowledge that includes not only what, when and where but also why. They have to teach the AI with atomic pieces (Knowledge Items, KI) that can be indexed and prioritized by the AI. Context and indexing enable this KIs to be combined to form many solutions.
KIs created by various IT experts create pooled expertise that is further optimized by machine selection of best knowledge combinations for problem resolution. This type of collaborative learning improves process time task by task. However, the number of possible permutations grows exponentially with added knowledge. Connected to a knowledge core, the General AI continuously optimizes performance by eliminating unnecessary steps and even changing routes based on other contextual learning. And the bigger the semantic graph and knowledge core gets, the better and more dynamically the infrastructure can act in terms of service disruptions.
On a final note, do not underestimate the "power of we"! Our research at Arago revealed that with an overlap of 33 percent in basic knowledge, this knowledge can and is used outside a specific organizational environment, i.e. across different client environments. The reuse of knowledge within a client is up to 80 percent. Thus, exchanging basic knowledge within a community becomes imperative from an efficiency perspective and improve the abilities of the General AI.
Why do your mobile transformations need to happen today? Mobile is the strategy that enterprise transformation centers on to drive customer engagement. In his general session at @ThingsExpo, Roger Woods, Director, Mobile Product & Strategy – Adobe Marketing Cloud, covered key IoT and mobile trends that are forcing mobile transformation, key components of a solid mobile strategy and explored how brands are effectively driving mobile change throughout the enterprise.
Mar. 30, 2017 06:00 AM EDT Reads: 3,128
My team embarked on building a data lake for our sales and marketing data to better understand customer journeys. This required building a hybrid data pipeline to connect our cloud CRM with the new Hadoop Data Lake. One challenge is that IT was not in a position to provide support until we proved value and marketing did not have the experience, so we embarked on the journey ourselves within the product marketing team for our line of business within Progress. In his session at @BigDataExpo, Sum...
Mar. 30, 2017 04:45 AM EDT Reads: 3,356
Keeping pace with advancements in software delivery processes and tooling is taxing even for the most proficient organizations. Point tools, platforms, open source and the increasing adoption of private and public cloud services requires strong engineering rigor - all in the face of developer demands to use the tools of choice. As Agile has settled in as a mainstream practice, now DevOps has emerged as the next wave to improve software delivery speed and output. To make DevOps work, organization...
Mar. 30, 2017 04:15 AM EDT Reads: 2,271
SYS-CON Events announced today that MobiDev, a client-oriented software development company, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MobiDev is a software company that develops and delivers turn-key mobile apps, websites, web services, and complex softw...
Mar. 30, 2017 02:15 AM EDT Reads: 4,139
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm.
Mar. 30, 2017 01:30 AM EDT Reads: 2,653
What sort of WebRTC based applications can we expect to see over the next year and beyond? One way to predict development trends is to see what sorts of applications startups are building. In his session at @ThingsExpo, Arin Sime, founder of WebRTC.ventures, will discuss the current and likely future trends in WebRTC application development based on real requests for custom applications from real customers, as well as other public sources of information,
Mar. 30, 2017 01:15 AM EDT Reads: 1,254
China Unicom exhibit at the 19th International Cloud Expo, which took place at the Santa Clara Convention Center in Santa Clara, CA, in November 2016. China United Network Communications Group Co. Ltd ("China Unicom") was officially established in 2009 on the basis of the merger of former China Netcom and former China Unicom. China Unicom mainly operates a full range of telecommunications services including mobile broadband (GSM, WCDMA, LTE FDD, TD-LTE), fixed-line broadband, ICT, data communica...
Mar. 30, 2017 12:15 AM EDT Reads: 3,635
SYS-CON Events announced today that Ocean9will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Ocean9 provides cloud services for Backup, Disaster Recovery (DRaaS) and instant Innovation, and redefines enterprise infrastructure with its cloud native subscription offerings for mission critical SAP workloads.
Mar. 29, 2017 09:30 PM EDT Reads: 2,490
Things are changing so quickly in IoT that it would take a wizard to predict which ecosystem will gain the most traction. In order for IoT to reach its potential, smart devices must be able to work together. Today, there are a slew of interoperability standards being promoted by big names to make this happen: HomeKit, Brillo and Alljoyn. In his session at @ThingsExpo, Adam Justice, vice president and general manager of Grid Connect, will review what happens when smart devices don’t work togethe...
Mar. 29, 2017 06:30 PM EDT Reads: 2,789
SYS-CON Events announced today that SoftLayer, an IBM Company, has been named “Gold Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York. SoftLayer, an IBM Company, provides cloud infrastructure as a service from a growing number of data centers and network points of presence around the world. SoftLayer’s customers range from Web startups to global enterprises.
Mar. 29, 2017 03:15 PM EDT Reads: 2,312
SYS-CON Events announced today that Technologic Systems Inc., an embedded systems solutions company, will exhibit at SYS-CON's @ThingsExpo, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Technologic Systems is an embedded systems company with headquarters in Fountain Hills, Arizona. They have been in business for 32 years, helping more than 8,000 OEM customers and building over a hundred COTS products that have never been discontinued. Technologic Systems’ pr...
Mar. 29, 2017 02:30 PM EDT Reads: 3,924
SYS-CON Events announced today that Auditwerx will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Auditwerx specializes in SOC 1, SOC 2, and SOC 3 attestation services throughout the U.S. and Canada. As a division of Carr, Riggs & Ingram (CRI), one of the top 20 largest CPA firms nationally, you can expect the resources, skills, and experience of a much larger firm combined with the accessibility and attent...
Mar. 29, 2017 02:30 PM EDT Reads: 867
SYS-CON Events announced today that CA Technologies has been named “Platinum Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business – from apparel to energy – is being rewritten by software. From ...
Mar. 29, 2017 02:30 PM EDT Reads: 2,376
SYS-CON Events announced today that HTBase will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. HTBase (Gartner 2016 Cool Vendor) delivers a Composable IT infrastructure solution architected for agility and increased efficiency. It turns compute, storage, and fabric into fluid pools of resources that are easily composed and re-composed to meet each application’s needs. With HTBase, companies can quickly prov...
Mar. 29, 2017 02:15 PM EDT Reads: 3,375
SYS-CON Events announced today that Loom Systems will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded in 2015, Loom Systems delivers an advanced AI solution to predict and prevent problems in the digital business. Loom stands alone in the industry as an AI analysis platform requiring no prior math knowledge from operators, leveraging the existing staff to succeed in the digital era. With offices in S...
Mar. 29, 2017 01:30 PM EDT Reads: 1,834
Buzzword alert: Microservices and IoT at a DevOps conference? What could possibly go wrong? In this Power Panel at DevOps Summit, moderated by Jason Bloomberg, the leading expert on architecting agility for the enterprise and president of Intellyx, panelists peeled away the buzz and discuss the important architectural principles behind implementing IoT solutions for the enterprise. As remote IoT devices and sensors become increasingly intelligent, they become part of our distributed cloud enviro...
Mar. 29, 2017 12:15 PM EDT Reads: 8,013
SYS-CON Events announced today that T-Mobile will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. As America's Un-carrier, T-Mobile US, Inc., is redefining the way consumers and businesses buy wireless services through leading product and service innovation. The Company's advanced nationwide 4G LTE network delivers outstanding wireless experiences to 67.4 million customers who are unwilling to compromise on ...
Mar. 29, 2017 11:45 AM EDT Reads: 2,661
SYS-CON Events announced today that Infranics will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Since 2000, Infranics has developed SysMaster Suite, which is required for the stable and efficient management of ICT infrastructure. The ICT management solution developed and provided by Infranics continues to add intelligence to the ICT infrastructure through the IMC (Infra Management Cycle) based on mathemat...
Mar. 29, 2017 11:00 AM EDT Reads: 3,544
SYS-CON Events announced today that Interoute, owner-operator of one of Europe's largest networks and a global cloud services platform, has been named “Bronze Sponsor” of SYS-CON's 20th Cloud Expo, which will take place on June 6-8, 2017 at the Javits Center in New York, New York. Interoute is the owner-operator of one of Europe's largest networks and a global cloud services platform which encompasses 12 data centers, 14 virtual data centers and 31 colocation centers, with connections to 195 add...
Mar. 29, 2017 10:15 AM EDT Reads: 1,834
SYS-CON Events announced today that Cloudistics, an on-premises cloud computing company, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Cloudistics delivers a complete public cloud experience with composable on-premises infrastructures to medium and large enterprises. Its software-defined technology natively converges network, storage, compute, virtualization, and management into a ...
Mar. 29, 2017 10:00 AM EDT Reads: 2,414