Archive

Posts Tagged ‘Disk Library’

Comprehensive Capacity Optimization – Deduplication 2.0

October 7th, 2009 Steve Kenniston No comments

Technology is great isn’t it?  When someone thinks they have a new idea on the same old technology foundation they call it “X 2.0″.  I have been watching the banter between analysts and vendors (specifically NTAP’s Dr. Dedupe and Permabit’s CEO Tom Cook) on the topic of Deduplication 2.0 and it is my belief that the proverbial boat is being missed (since we are using water analogies).  I have been watching these guys hash it out for the past few weeks and decided I have to jump in.  I find the real value to these conversations is the value to the end user.  At the end of the day, it doesn’t really matter who ‘coined’ or ‘invented’ a term (like deduplication 2.0) but what does matter is if  the term actually helps describe a technology and how that technology can be leveraged to make things better in the data center.  We should focus on the implications of this new generation of deduplication – ‘deduplication 2.0’.

In May I delivered a presentation to a number of EMC customers on the topic of Data Deduplication 2.0 – Comprehensive Capacity Optimization.  The point of my presentation was simple (and keep in mind this was before the Data Domain acquisition); there are a number of capacity optimization technologies/capabilities that are available to customers today.  Originally these deduplication technologies were used primarily for backup purposes but slowly, deduplication is making its way into primary storage. Deduplication in primary storage makes a lot of sense FOR DATA THAT IS STATIC.  Why only static data?  Static data is data that isn’t used frequently (doesn’t mean it’s not important, it just simply is not accessed often); because access to this data is infrequent, the performance requirements for this data is less than that of active data. Remember; nothing in IT is free.  If I deduplicate data, in order to use it, I must ‘rehydrate’ it and thus there is a performance implication so I want to be careful where I deduplicate data so as not to inhibit performance on production data.

Dr. Dedupe and Tom allude to Deduplication 2.0 moving beyond backup storage and into primary storage.  While deduplication in primary storage is technically possible, it is important that customers understand two important points:

1) Performance: whatever I do to deduplicate (I like optimize) capacity in order to save space, I must ‘undo’ in order to use the data.  If I set a policy that says any data that is 30 days old can be ‘optimized’, I need to be sure that data 30 days old is not active or I could pay a substantial performance penalty when using this data.  I may set a policy ‘any data that hasn’t be touched in 30 days, can be optimized.  I would just want to make sure that there is no scenario where at the end of a quarter let’s say, I would need to rehydrate all data in order to run some report.

2) Comprehensive and cumulative deduplication throughout my storage tiers.  What do I mean?  If I compress and single instance (deduplicate) data on my primary storage utilizing one set of deduplication technologies, say single instancing and compression algorithms, and then I backup this data using sub-file deduplication, a separate set of algorithms, then what I am left with are two separate sets of deduplicated data silos, and no one wins in this scenario.

It is important, no matter what deduplication technology you decide to use, that you can actually leverage the data stored in the deduplication device and that as data moves from device to device it doesn’t need to be rehydrated before it is moved.

A great use case of capacity optimization in primary storage is how EMC evolved the Celerra product this year.  Through a policy, let’s say any data that is older than 30 days, is compressed and stored as a single instance, with users seeing as much as 30% to 50% storage savings.

The real goal of Deduplication 2.0, and I think Dr. Dedupe alluded to this in his post “The Dedupe 2.0 Pundits Are Still Swimming in Lake 1.0” is that customers win when deduplication technology is a part of the core system or file system, when I no longer need to rehydrate data as I move it from primary storage to secondary storage.  If each storage device in the ’stack’ understands the language of the device in the stack ahead of it and the ‘deduplication’ or file system is coordinated and cumulative from device to device than the customer is the winner.  This pertains to primary storage, backup storage and archive storage.  Never having to rehydrate data allows for more efficiency and a reduced tax on devices that can save the end user money.

Tom Cook, CEO of Permabit points out in his blog post “Dedupe 1.0 vs. Dedupe 2.0: The debate ensues” that the only value to deduplication for primary storage is to move your data to a deduplicated archive which allows you to store data, efficiently, long term which I agree with, but as we have seen, not that practical.  Why? Because at the end of the day, the costs to manage storage are going up, up, up and the costs to buy storage are going down, down, down.  End users (NOT IT) are generally lazy or should I really say, just too busy to manage this storage.  In order to properly archive data, you need to have a policy that tells you what to move and when to move it.  IT can make all the recommendations in the world about the value of archive, but if users or really, lines of business managers don’t tell IT what data is important and what can be archived, then IT doesn’t really have a choice, which makes the premise of moving data to an archive, deduplicated or not – moot.

The real issue is balancing capacity optimization (to what granularity you deduplicate data) against performance on the appropriate tier of data, given that deduplication will happen on all tiers of storage.  The higher the performance requirements (tier 1) the less ‘optimized’ I make the data, the lower the performance requirements (tier x, archive) the more optimized I make the data.  The benefits to the customer are that I can A) optimize data, consistently among each of its devices, and B) it can be cumulative from device to device, removing silos of deduplicated data across the stack.

For more on tiered dedupe, read my Betamax Redux blog post on EMC’s vision for deduplication and hopefully this will put you on a high performance ‘Road to Recovery’.

Post to Twitter Tweet This Post

Scridb filter

Process vs. Technology

May 1st, 2009 Steve Kenniston 1 comment

The hardest thing to change inside IT is not technology, it is process!  I say this because all too often there are technologies available that provide a far superior solution to a complex IT problem, however, this new technology may not fit into your existing business process.  Need proof?  Let’s take data protection as an example.  Did you know that VTLs (virtual tape libraries) and data deduplication technologies came out at the exact same point in history, 10 years ago?  Which technology had faster market adoption?  VTLs of course because implementing them didn’t cause a major disruption in processes.

Let’s take a look at a simple backup environment.  We won’t worry about archiving or compliance for the moment, just operational backup and recovery.  Today’s backup has a number of complexities.  There are some data sets that have weekly full backups and daily incremental backups.  There are some data sets that sit under applications that, for faster recovery capabilities and simplicity, require daily full backups.  Once the backups are done, in order to ensure true data protection reliability, a process of checking the backup logs to ensure every system was successfully protected begins.  Next, backup tapes are either created (if it is a disk based backup) or tapes are taken from the library and moved to a transportable box, hopefully a secure box.  Finally, a third party vendor comes to pick up the tapes and take them off site for safe-keeping.  Additionally, if the data is backed up using encryption, then the encryption keys are also kept off site for security purposes.

 Customers face these standard backup challenges:

1) Backups take too long and cannot meet backup windows as a result of too much data.

2) Backups fail due to poorly configured (networked) backup environments.

3) Backups at remote offices are ‘unreliable’. (Don’t follow best practices set in the data center.)

a. No one with the appropriate skill set is available to monitor these backups.

b. No one with the appropriate skill set is available to troubleshoot these backups.

c. No one with the appropriate skill set is available to perform data recovery.

4) New applications / processes cause additional challenges; does this application need incremental backups, full backups, what is the RPO / RTO???

5) Managing backup tapes is too difficult and costly.

However, the reality is that in this particular IT shop, no one has ever been fired for data loss. Each time there is a recovery request, data is recovered.  It may not be the absolute most recent data, or it may take 48 hours to recover, but eventually, the data is recovered. The question is, has everyone’s business objectives been met? Chances are the answer is “no” but when the issue of what it would cost to meet everyones’ needs comes up, there is usually no money in the budget for ‘backup’ and it’s right back to the same old way of doing things. Backup is not really strategic to a business (unless of course you’re in the business of providing backup solutions to customers) but it is more of an insurance policy. There is no doubt you need it, but you want it for the lowest possible price, hope you never have to call on it, and when you do, you better get good service.

Maybe that is why EMC is now the GEICO of data protection.

 That aside, when there is money in the budget, it usually comes in small doses so backup administrators have to make the biggest impact in the ‘easiest’ way possible. This means, implement something that allows them to meet most of their challenges and doesn’t:

1) Change process because they already have run books established for data recovery and because everyone is already trained on the existing technology.

2) Change configuration because they have already invested a great deal of time and money to sort out their issues with the existing products.

3) Cost a lot of money

That usually means, augmenting the existing backup software technology with something that allows them to gain some efficiencies on the backend because they already have significant investments in their backup software. This was one of the main reasons for the success of VTL (virtual tape libraries). It is way easier to unplug the slow, serial tape library and replace it with fast, parallel disk. The backup administrator gets all the advantages of disk and doesn’t have to change a single process, except for maybe adding a step of cloning the data from the disk that looks exactly like tape, to an actual tape in order to offsite the data. Additionally, this is why companies with target deduplication devices became so popular so quickly. When VTL was having challenges solving backup data capacity issues, deduplication became the next popular thing.  The big issue was plugging into the existing infrastructure without disruption.  If I have to change too much about my process, I can’t ‘afford’ to make it work.

The trouble is backup administrators are at an inflection point. They can no longer continue to use the same old technology at the front of the backup process and meet the needs of the business. We are at a time when new technologies such as source based deduplication technologies can really have a significant impact on a number of the backup challenges. The problem is that it goes against the grain of why IT doesn’t want to change technology, because it forces a change to the process. For example, out come the traditional backup agents and new ones are put into place. Since data no longer is stored in tape format, new processes must be utilized for getting tape offsite. When backup administrators hear this, they tend to shy away from it. It costs money and it changes processes right when they had all the original processes figure out.  It is only now that source based deduplication solutions have gained significant momentum as it is really solving a number of the key data protection challenges for more than 70% of the data in most data centers.

  • Remote offices can now experience the same set of data protection best practices that are used in the data center. (Keeping in mind, IT is accountable for 100% of the data created in the corporate, local or remote.  This is good piece of mind.)VMware environments tend to ruin a TCO when using traditional backup applications. Leveraging source based deduplication can bring up your TCO and ROI.

This is not to say that source based deduplication is the savior of the backup world. It is not. There are places where source based deduplication technologies are not the best fit. Very large environments with very high change rates and little duplicate data don’t tend to be good fits. However, if you attack the places that are a good fit for source based deduplication, you will create relief in your backup environment at the target and that will be good for everyone.  It is time to take backup, beyond.

Posted by Steve Kenniston

Post to Twitter Tweet This Post

Scridb filter

Road to Recovery

February 14th, 2009 Steve Kenniston No comments

Our domain, Backup & Beyond was the tagline for Avamar Technologies, a company EMC acquired in November of 2006.  This tagline was very fitting from a data protection standpoint because Avamar utilized a traditional client / server architecture to protect data but with a twist.  Avamar utilizes a more intelligent client side agent that provides source based, variable block deduplication to enable the most efficient backups available in the market for more than 80% of a data centers data.  Avamar also leverages this same technology to replicate this data between disk based backup targets there by dramatically reducing the reliance on tape.  This new technology, that has enabled new processes is taking backup beyond.

The title of our blog, Road to Recovery – well, like every good title it is a play on words and trust me, as with every title it took us a while to come up with it.  That said, the industry has been talking about the fact that backup is really about recovery.  The same can be said for other data protection tools.  This is why our goal is to talk about methodologies (technologies and processes) that help you to recover data.  When IT professionals are polled, they often say that data protection (backup) is still the number one issue they have in the data center.  We say it is time to stepup and admit it and start the ‘Road to Recovery’ when it comes to your data protection environment.

Let us know what your challengs are, we are here for you, your support system and we welcome you comments and questions.

Post to Twitter Tweet This Post

Scridb filter

Twitter links powered by Tweet This v1.6.1, a WordPress plugin for Twitter.