Archive

Archive for the ‘Cloud’ Category

Comprehensive Capacity Optimization – Deduplication 2.0

October 7th, 2009 Steve Kenniston No comments

Technology is great isn’t it?  When someone thinks they have a new idea on the same old technology foundation they call it “X 2.0″.  I have been watching the banter between analysts and vendors (specifically NTAP’s Dr. Dedupe and Permabit’s CEO Tom Cook) on the topic of Deduplication 2.0 and it is my belief that the proverbial boat is being missed (since we are using water analogies).  I have been watching these guys hash it out for the past few weeks and decided I have to jump in.  I find the real value to these conversations is the value to the end user.  At the end of the day, it doesn’t really matter who ‘coined’ or ‘invented’ a term (like deduplication 2.0) but what does matter is if  the term actually helps describe a technology and how that technology can be leveraged to make things better in the data center.  We should focus on the implications of this new generation of deduplication – ‘deduplication 2.0’.

In May I delivered a presentation to a number of EMC customers on the topic of Data Deduplication 2.0 – Comprehensive Capacity Optimization.  The point of my presentation was simple (and keep in mind this was before the Data Domain acquisition); there are a number of capacity optimization technologies/capabilities that are available to customers today.  Originally these deduplication technologies were used primarily for backup purposes but slowly, deduplication is making its way into primary storage. Deduplication in primary storage makes a lot of sense FOR DATA THAT IS STATIC.  Why only static data?  Static data is data that isn’t used frequently (doesn’t mean it’s not important, it just simply is not accessed often); because access to this data is infrequent, the performance requirements for this data is less than that of active data. Remember; nothing in IT is free.  If I deduplicate data, in order to use it, I must ‘rehydrate’ it and thus there is a performance implication so I want to be careful where I deduplicate data so as not to inhibit performance on production data.

Dr. Dedupe and Tom allude to Deduplication 2.0 moving beyond backup storage and into primary storage.  While deduplication in primary storage is technically possible, it is important that customers understand two important points:

1) Performance: whatever I do to deduplicate (I like optimize) capacity in order to save space, I must ‘undo’ in order to use the data.  If I set a policy that says any data that is 30 days old can be ‘optimized’, I need to be sure that data 30 days old is not active or I could pay a substantial performance penalty when using this data.  I may set a policy ‘any data that hasn’t be touched in 30 days, can be optimized.  I would just want to make sure that there is no scenario where at the end of a quarter let’s say, I would need to rehydrate all data in order to run some report.

2) Comprehensive and cumulative deduplication throughout my storage tiers.  What do I mean?  If I compress and single instance (deduplicate) data on my primary storage utilizing one set of deduplication technologies, say single instancing and compression algorithms, and then I backup this data using sub-file deduplication, a separate set of algorithms, then what I am left with are two separate sets of deduplicated data silos, and no one wins in this scenario.

It is important, no matter what deduplication technology you decide to use, that you can actually leverage the data stored in the deduplication device and that as data moves from device to device it doesn’t need to be rehydrated before it is moved.

A great use case of capacity optimization in primary storage is how EMC evolved the Celerra product this year.  Through a policy, let’s say any data that is older than 30 days, is compressed and stored as a single instance, with users seeing as much as 30% to 50% storage savings.

The real goal of Deduplication 2.0, and I think Dr. Dedupe alluded to this in his post “The Dedupe 2.0 Pundits Are Still Swimming in Lake 1.0” is that customers win when deduplication technology is a part of the core system or file system, when I no longer need to rehydrate data as I move it from primary storage to secondary storage.  If each storage device in the ’stack’ understands the language of the device in the stack ahead of it and the ‘deduplication’ or file system is coordinated and cumulative from device to device than the customer is the winner.  This pertains to primary storage, backup storage and archive storage.  Never having to rehydrate data allows for more efficiency and a reduced tax on devices that can save the end user money.

Tom Cook, CEO of Permabit points out in his blog post “Dedupe 1.0 vs. Dedupe 2.0: The debate ensues” that the only value to deduplication for primary storage is to move your data to a deduplicated archive which allows you to store data, efficiently, long term which I agree with, but as we have seen, not that practical.  Why? Because at the end of the day, the costs to manage storage are going up, up, up and the costs to buy storage are going down, down, down.  End users (NOT IT) are generally lazy or should I really say, just too busy to manage this storage.  In order to properly archive data, you need to have a policy that tells you what to move and when to move it.  IT can make all the recommendations in the world about the value of archive, but if users or really, lines of business managers don’t tell IT what data is important and what can be archived, then IT doesn’t really have a choice, which makes the premise of moving data to an archive, deduplicated or not – moot.

The real issue is balancing capacity optimization (to what granularity you deduplicate data) against performance on the appropriate tier of data, given that deduplication will happen on all tiers of storage.  The higher the performance requirements (tier 1) the less ‘optimized’ I make the data, the lower the performance requirements (tier x, archive) the more optimized I make the data.  The benefits to the customer are that I can A) optimize data, consistently among each of its devices, and B) it can be cumulative from device to device, removing silos of deduplicated data across the stack.

For more on tiered dedupe, read my Betamax Redux blog post on EMC’s vision for deduplication and hopefully this will put you on a high performance ‘Road to Recovery’.

Post to Twitter Tweet This Post

Scridb filter

A Data Protection Reference Architecture – The Final Chapter

September 1st, 2009 Steve Kenniston 2 comments

The Architecture

This ‘architecture’ diagram, as you can see, is not a typical architecture diagram, but hopefully it can be used to align your business and business objectives with the technologies that are available and can best be applied to solve your issues helping to balance, cost, complexity and compliance.

This diagram can also be used to do a couple of other things.  It can help you begin to classify your data and align your  data to your business objectives.  It also lets you begin to identify what data or data services in your environment that may be more important to you than others and based on this help you to choose areas you may want to outsource or move to the cloud.

As you can tell, there really is not one solution for meeting all your data protection needs.  The challenge comes with managing multiple solutions in an effort to meet your business objectives.  While there are only a few technologies available that allow you to manage your environment across all your RPOs and RTOs, it is important that I point out EMC’s NetWorker is able to do this, centralizing your data protection infrastructure  for ease of management.  It allows you to manage traditional backup, source based deduplicated backup with Avamar, CDP with RecoverPoint, as well as the EMC disk libraries and tape where the data is stored.  Now, I am not saying that NetWorker solves all of your data protection challenges, nor am I suggesting that replacing one traditional backup technology for another is the right answer, but what I am saying is that if you’re looking to have all the feature functionality required to meet all your business objectives and you want easier management, NetWorker is one avenue to get you there.  Additionally, the underlying image of the triangle represents data protection management.  Putting all the new technology in place is one thing, managing it, and ensuring you are now meeting your business needs is another.  EMC’s Data Protection Advisor can help here as well.

This diagram can help customers layout a new, better data protection schema for their environment and start thinking about data protection a bit more strategically versus tactically.  It can also help vendors speak to customers about how they should look at their environment in order to identify specific challenges and the means they need to alleviate these challenges , taking backup, beyond.

Post to Twitter Tweet This Post

Scridb filter

A Data Protection Reference Architecture – Part 1

August 14th, 2009 Steve Kenniston No comments

This blog will have multiple parts.  I will introduce my view of a data protection reference architecture and the next few blog posts will talk to components of that architecture.

The other day  I had a very interesting conversation with a colleague of mine in Australia.  He was looking for a data protection reference architecture that he could use to speak to his customer.  As you can imagine having this conversation over the phone could pose to be a difficult challenge.  When the conversation began, my fear was he was looking for an ‘architecture’ diagram that included data protection appliances, backup servers, disk libraries, tape libraries and backup agents.  I quickly realized that this is an impossible conversation to have with him without knowing:

A)     the customer’s environment or challenges

B)      the customer’s business objectives

I find that most vendors don’t know A or B when speaking to a customer about their data protection ‘issues’, but they really should.  Having a more thoughtful conversation with customers in a consultative fashion is more relevant to customers in understanding their challenges and helping to align these challenges to the best possible solution.

I started my conversation with the diagram shown below (Figure 1).  A simple triangle divided horizontally into 4 segments and the middle two segments divided vertically in half.  Each segment represents different business objectives within a company.  As you go around the triangle, you can see that there are different technologies and different methodologies for attacking data protection challenges, which is why there is no longer a “one size fits all” approach when it comes to protecting data today. Let’s face it; the two most important commodities in backup are time and capacity.  One of the primary drivers behind the type of protection that is used is the Recovery Point Objective or RPO.  Different technologies provide different RPOs and each has a different price point as well as there are different processes that can be applied to attach RPOs.

Figure 1

Figure 1

Having a conversation specific to this diagram can have a tremendous amount of value on a number of fronts, including; aligning technology needs with business objectives as well as highlighting critical pain points and beginning a roadmap that helps implement data protection technology based on business needs and budget and put you on the Road to Recovery.

The next post will cover the foundation of the triangle – Archive.

Post to Twitter Tweet This Post

Scridb filter

EMC World Kicks Off with Clouds and Virtualization

May 18th, 2009 Steve Kenniston No comments

EMC World kicked off this morning first with a presentation from yours truly on Data Deduplication 2.0 – Comprehensive Capacity Optimization.  We discussed how data deduplication 1.0 is morphing into all areas of EMC’s storage ecosystem in order to optimize capacity everywhere.  I talked about data deduplication as well as single instancing and compression are technology components that will help EMC achieve this goal.

Next Joe Tucci spoke in his keynote about how data deduplication as well as compression are key technologies for the data center of the future and how these technologies will aid in delivering a more efficient cloud computing strategy.  Not only will these technologies help in building out a cloud infrastructure, they will also help to protect a cloud infrastructure (which is what we are all about here).

Finally, Paul Maritz gave his keynote on how the virtual infrastructure will help to fulfill the goals of a private cloud.  He also discussed that it is time to invest in software and people and not hardware as VMware continues to drive value into their software to help make your data center, better, smarter, stronger and faster for less.

Each of these initiatives will have an impact on how data is stored and ultimately protected but new storage services will enable more efficient storage and protection across the virtual data center and the cloud and ultimately take backup beyond and put you on the road to recovery.

Stay tuned for more updates about the show.

Post to Twitter Tweet This Post

Scridb filter

Twitter links powered by Tweet This v1.6.1, a WordPress plugin for Twitter.