Gremlin Brings Chaos Engineering To Every Cloud Organization - Reducing System Downtime and Saving Millions

Grazed from Gremlin

Gremlin helps companies build more resilient systems through a new engineering philosophy called chaos engineering. It is launching with the availability of its Gremlin tool and announcing Series A funding from Index Ventures and Amplify Partners. Starting today, any company will be able to employ chaos engineering to safely inject failure into systems in order to proactively identify and fix unknown faults - similar to an engineering flu shot.

Each year, North American businesses lose over $700 billion a year due to outages. In 2017 alone, major companies including Amazon, Whatsapp,, and Slack have all experienced outages that impacted the bottom line and inconvenienced customers. This unreliability is due to the complexity gap in how distributed systems are built. Previously, software ran in a controlled, bare metal environment that introduced few variables, making it possible for engineering teams to identify potential risk and failures before they occurred. Within the last decade, systems have shifted to the cloud and become distributed with microservices and serverless methodologies, which introduced new dependencies on services outside of one's control - creating complexity for any team of engineers to fully understand. This makes failure and outages inevitable.

Cloud Computing: Why Apple's website crash is troubling

Grazed from CBSNews. Author: Erik Sherman.

Apple (AAPL) had a real hit with the iPhone 6 introduction -- so much so that when it began taking orders on Friday, the flood of traffic crashed the Apple Store. It was restored but availability has been intermittent. The problem has a silver lining, as it likely means Apple's financial results through the rest of the calendar year will be rosy, even if broad popularity for the Apple Watch is unclear.

But combined with the video streaming outage during Tuesday's product introduction event, it's a reminder that Apple has yet to completely master the highest demand situations in cloud computing -- an area that is critical to the company's strategy...

Cloud Services And The Hidden Cost Of Downtime

Grazed from NetworkComputing. Author: Frank J. Ohlhorst.

As any networking professional knows, downtime costs money. However, few know exactly how much money downtime costs. Estimates, calculations, and incidentals are all open to interpretation. This creates a lot of uncertainty. Cloud computing is a good tool to use here. Many IT pros are turning to cloud-based technologies to mitigate the cost of downtime. However, is the viability of a cloud migration backed by facts or based on suppositions?

The assumption that cloud services can reduce downtime is founded on the belief that third-party providers deploy all sorts of continuity technology that all but guarantees uptime. That belief, coupled with service-level agreements (SLAs) that make promises about limiting unscheduled interruptions in service, can give you a sense of security. The real question becomes whether that sense of security is false or justified -- and, more importantly, whether a value can be assigned to it...

Free cloud storage service MegaCloud goes dark

Grazed from ComputerWorld. Author: Brandon Butler.

The website for MegaCloud, a provider of free and paid consumer cloud storage, is inaccessible and users of the service are complaining on social media sites that they have not had access to their data for days. It's unclear why the service is not working and whether it is because of a technical glitch or if the company has gone out of business. When attempting to access, it returns a notice that the site could not be found. None of the other sites associated with MegaCloud work either, such as the features page, pricing information and the contact links.

The service appears to have been dark for at least a couple of days. Users on various Internet forums have been posing questions about the service and why it is down, for two days, as of Friday. The site shows that the site is not accessible, and has about a dozen people questioning why...

Cloud Computing: Top 5 causes of virtual desktop and application downtime

Grazed from TechTarget. Author: Ed Tittel and Earl Follis.

Once you've bet your IT strategy on virtual desktops, availability and performance of those desktops become a priority. If users cannot access their virtual desktop or performance is too slow to get their work done, the villagers will no doubt come after you with torches, demanding their local desktops back. To help you avoid getting burned, let's look at the top five factors that affect virtual desktop and application downtime, plus how to avoid -- or at least mitigate -- these risks.

1. Lack of end-user monitoring

Have you ever received a flurry of help desk complaints about poor performance for specific applications? Your infrastructure team checks the network, servers, databases and applications, only to indicate that all infrastructure components are up and running as expected. This situation illustrates the difference between IT services being up and IT services being available from an end-user’s perspective. If users perceive a performance issue, then you have a performance issue, whether your monitoring tools reflect that perception or not. You can avoid this problem by having an end-user performance monitoring tool in your overall monitoring toolkit. These tools can give you performance statistics and alerts based on availability and response time for your virtual desktops, from an end-user perspective...

Cloud Computing: Why you should expect more online outages but less downtime

Grazed from GigaOM.  Author: Stacey Higginbotham.

Gmail went down for 18 minutes during prime email checking hours on the West Coast thanks to a routine software update conducted Monday morning. But in an era of continuous code deployment Google’s mid morning update isn’t unusual — it’s the future.

Google’s webmail service Gmail was down for 18 minutes last week after a “routine update” broke the service for a few minutes. The search giant reported that it conducted a routine update of its load balancing software between 8:45 AM PT and 9:13 AM PT and after the problems were detected managed to quickly roll back the buggy code. But this didn’t stop some people from questioning why Google would roll out a software update at what are peak email-checking hours on the West Coast...

Cloud outage report of 13 providers reveals downtime costs

Grazed from TechTarget.  Author: Stuart Johnston.

Amazon's cloud services downtime earlier this month annoyed some AWS customers but beyond that, it raised the question of how expensive cloud outage downtime can be.

This week, a study group called the International Working Group on Cloud Computing Resiliency (IWGCR) released its first Availability Ranking of World Cloud Computing report. The working group was formed in March by two Paris-based higher educational institutions, Telecom ParisTech and Paris 13 University.

The report's bottom line isn't pretty. It estimates the average unavailability of cloud services at 10 hours per year or more, while the average availability is estimated to 99.9% or less...