Archive

Posts Tagged ‘Replication’

How Much Backup Capacity Does Deduplication Really Save?

November 30th, 2009 Steve Kenniston No comments

There is a lot of discussion around data deduplication for backup these days.  (I wish I could deduplicate all the turkey I ate last week.)  In fact, Gartner claims that “…by 2012, deduplication will be applied to 75% of backups.”  And when asked “Why?” the response was “…deduplication is too compelling to ignore.”  But I say “prove it”.  So I put together some backup capacity numbers for storing data on tape (non-compressed and compressed) versus storing data, deduplicated (fixed block and variable block), on disk and the numbers show a dramatic savings in backup space which translates into cost savings.

The Parameters

As with any ‘analysis’ numbers can be ‘spun’ to make them say what you want.  That said, I tried to be as straight forward as possible, so let me also show my methodology so you can see how my numbers were derived.

  • I charted the amount of capacity created using a retention policy of:
    • 14 Dailies
    • 4 Weeklies
    • 12 Monthlies
  • I selected 10TB of primary storage capacity
  • I did this for file system backups only
  • I charted the data for 30%, 40%, 50% and 60% primary storage growth rates
  • I charted traditional tape based backup (non-compressed)
  • I charted traditional tape based backup (compressed, 2:1)
  • I charted fixed block disk based deduplicated backup
  • I charted variable block disk based deduplicated backup (3 to 5 times more efficient than fixed block deduplication)

The Effect

The first thing to think about is the sheer number of full backup copies that must be maintained when utilizing the above retention schedule.  The above retention policy leads to 17.2 copies of the primary storage (12 yearly’s + 4 monthlies + the equivalent of 1.2 with dailies = 17.2 copies) .  Translation: one terabyte of primary storage becomes 17.2 terabytes of tape storage.  This means, backup administrators need to pay for the physical tapes as well as the offsite transport and storage costs.  Now 17.2 terabytes of tape doesn’t sound like much but keep in mind that is for 1TB of primary capacity.  Ten TB of primary capacity yields 172 TB of tape capacity.  Now add in year over year storage growth.  At 30% primary storage growth, the backup storage growth grows 23%, at 40% primary storage growth, the backup storage growth grows 29%, at 50% primary storage growth, the backup storage growth grows 33% and at 60% primary storage growth and the backup storage grows 38%.

Figure 1 below shows, 10 TB of primary capacity growing at 30%, 40%, 50% and 60% along the x-axis respectively and the corresponding capacity of tape or disk consumed along the y-axis is.

Figure 1

The graph shows that compressed backup to tape obviously yields a 50% capacity improvement over non-compressed tape as one would expect. It also reflects that fixed block deduplicated disk capacity is only about 48% more efficient than uncompressed tape storage yet variable block deduplication is 81% more storage efficient than uncompressed tape storage.

Interesting as well, the chart reveals that fixed block deduplication is 3% less efficient than compressed tape whereas variable block deduplication is 62% more efficient than compressed tape. Typically, with the same data change rates, and equivalent data sets, variable block deduplication is 3 to 5 times more efficient than fixed block deduplication.

The moral of the story – if you’re going to do deduplication, variable block is the way to go. From a cost perspective, there is essentially no difference in the $/TB price however there is much more value in the long run with variable block deduplication. Vendors typically charge a $/TB price for their deduplication solutions. The difference between fixed and variable block deduplication comes down to the capacity of data that is stored in the backups which directly translates into costs. If you take a look at Figure 2, over time, starting with 1TB of primary capacity growing at 25% over the course of one year, IT will need almost 2TB of backup capacity with fixed block deduplication versus less than 1TB of capacity using variable block deduplication (assumes fixed block is 5x less efficient from imperial data that has been collected in the field.). The most important part of this graph is the slope of the blue and red lines. The greater the degree of slope (red line), the more frequently IT will need to purchase capacity to protect the given data set as well as need to pay for licensing as it pertains to deduplication software. IT wants the smaller slope.

Figure 2

*Note: Some companies will position their fixed block technologies as variable block by stating that you (the user) has the ability to set the block size to what ever you want, however, once set, it stays that way for all of your data.  The difference is, true variable technologies adjust the block size on the fly using their algorithms to ensure maximum efficiency with no management.

Bang for the Buck

The most important benefit, as with most things in IT however is overall cost savings. Deduplicated disk solutions are anywhere from 2.5X to 3X more expensive than tape, however with the overall capacity savings, there can be significant cost savings. Figure 3 is representative of the overall costs of new deduplicating disk systems and traditional tape backup systems (including tapes and off-site storage costs). I will caveat this by saying every TCO and ROI has a ton of ‘what ifs’ that factor into overall costs including things like FTE for backup engineers and long term retention costs, but for the most part, disk systems reduce a good deal of these costs (with the exception of power and cooling) and increase the reliability, security and performance of backups and recoveries.

Figure 3

1 The chart above is based on a rough cost of $8,000 per terabyte of tape backup system costs (including media and off-site storage) and rough cost of $20,000 per terabyte of deduplicated disk backup system costs for the period of one year.  Prices will vary depending upon your configuration and these estimates do not include space, power, cooling or human costs.

As I stated above there are only a few factors that are involved in this very raw calculation.  There are a number of other factors involved with a backup process including WAN costs (if replacing tape with disk), remote office facilities, installation (professional services), and software and hardware maintenance to name a few.  But no matter how you look at it, disk based backup with variable block deduplication wins over tape.

Backing data up to deduplicated disk not only saves the amount of backup capacity that is used, it also has other implications for a data protection environment.  First, backing up to disk versus backing up to tape helps to reduce the reliance on tape and the inherent limitations, security concerns and reliability issues surrounding tape.  Recovery of data from disk reduces the operational costs and decreases the recovery time objective.  Additionally the reliability of disk with RAID is much higher than the reliability of tape.

New data protection technologies are evolving backup to a degree where the entire data protection process is getting easier manage by removing multiple points of management (backup servers, media servers, tape libraries and physical tape).  As backup continues to evolve, this can help simplify the overall process and;

  • Increase reliability of backups
  • Reliability of recoveries
  • Decrease backup times
  • Decrease the time to recover data

The Bottom Line

New challenges in protecting information are arising every day, whether it is data growth, remote office data protection or virtualization, backup is getting harder not easier.  Data deduplication is providing backup administrators with tremendous benefits around backup processes and cost savings.  It is important to keep in mind that everybody’s environment is different and utilizes different methods and processes for managing and protecting information.  It is also important to take a look at your data protection environment today and understand the use cases where it is time to make new investments.  I encourage you to look at new technologies to help you with emerging challenges and weigh the overall solution including costs as well as benefits of disk based recovery.  New backup technologies that leverage data deduplication can save IT a lot of money and put you on back on the Road to Recovery.

Post to Twitter Tweet This Post

Scridb filter

Comprehensive Capacity Optimization – Deduplication 2.0

October 7th, 2009 Steve Kenniston No comments

Technology is great isn’t it?  When someone thinks they have a new idea on the same old technology foundation they call it “X 2.0″.  I have been watching the banter between analysts and vendors (specifically NTAP’s Dr. Dedupe and Permabit’s CEO Tom Cook) on the topic of Deduplication 2.0 and it is my belief that the proverbial boat is being missed (since we are using water analogies).  I have been watching these guys hash it out for the past few weeks and decided I have to jump in.  I find the real value to these conversations is the value to the end user.  At the end of the day, it doesn’t really matter who ‘coined’ or ‘invented’ a term (like deduplication 2.0) but what does matter is if  the term actually helps describe a technology and how that technology can be leveraged to make things better in the data center.  We should focus on the implications of this new generation of deduplication – ‘deduplication 2.0’.

In May I delivered a presentation to a number of EMC customers on the topic of Data Deduplication 2.0 – Comprehensive Capacity Optimization.  The point of my presentation was simple (and keep in mind this was before the Data Domain acquisition); there are a number of capacity optimization technologies/capabilities that are available to customers today.  Originally these deduplication technologies were used primarily for backup purposes but slowly, deduplication is making its way into primary storage. Deduplication in primary storage makes a lot of sense FOR DATA THAT IS STATIC.  Why only static data?  Static data is data that isn’t used frequently (doesn’t mean it’s not important, it just simply is not accessed often); because access to this data is infrequent, the performance requirements for this data is less than that of active data. Remember; nothing in IT is free.  If I deduplicate data, in order to use it, I must ‘rehydrate’ it and thus there is a performance implication so I want to be careful where I deduplicate data so as not to inhibit performance on production data.

Dr. Dedupe and Tom allude to Deduplication 2.0 moving beyond backup storage and into primary storage.  While deduplication in primary storage is technically possible, it is important that customers understand two important points:

1) Performance: whatever I do to deduplicate (I like optimize) capacity in order to save space, I must ‘undo’ in order to use the data.  If I set a policy that says any data that is 30 days old can be ‘optimized’, I need to be sure that data 30 days old is not active or I could pay a substantial performance penalty when using this data.  I may set a policy ‘any data that hasn’t be touched in 30 days, can be optimized.  I would just want to make sure that there is no scenario where at the end of a quarter let’s say, I would need to rehydrate all data in order to run some report.

2) Comprehensive and cumulative deduplication throughout my storage tiers.  What do I mean?  If I compress and single instance (deduplicate) data on my primary storage utilizing one set of deduplication technologies, say single instancing and compression algorithms, and then I backup this data using sub-file deduplication, a separate set of algorithms, then what I am left with are two separate sets of deduplicated data silos, and no one wins in this scenario.

It is important, no matter what deduplication technology you decide to use, that you can actually leverage the data stored in the deduplication device and that as data moves from device to device it doesn’t need to be rehydrated before it is moved.

A great use case of capacity optimization in primary storage is how EMC evolved the Celerra product this year.  Through a policy, let’s say any data that is older than 30 days, is compressed and stored as a single instance, with users seeing as much as 30% to 50% storage savings.

The real goal of Deduplication 2.0, and I think Dr. Dedupe alluded to this in his post “The Dedupe 2.0 Pundits Are Still Swimming in Lake 1.0” is that customers win when deduplication technology is a part of the core system or file system, when I no longer need to rehydrate data as I move it from primary storage to secondary storage.  If each storage device in the ’stack’ understands the language of the device in the stack ahead of it and the ‘deduplication’ or file system is coordinated and cumulative from device to device than the customer is the winner.  This pertains to primary storage, backup storage and archive storage.  Never having to rehydrate data allows for more efficiency and a reduced tax on devices that can save the end user money.

Tom Cook, CEO of Permabit points out in his blog post “Dedupe 1.0 vs. Dedupe 2.0: The debate ensues” that the only value to deduplication for primary storage is to move your data to a deduplicated archive which allows you to store data, efficiently, long term which I agree with, but as we have seen, not that practical.  Why? Because at the end of the day, the costs to manage storage are going up, up, up and the costs to buy storage are going down, down, down.  End users (NOT IT) are generally lazy or should I really say, just too busy to manage this storage.  In order to properly archive data, you need to have a policy that tells you what to move and when to move it.  IT can make all the recommendations in the world about the value of archive, but if users or really, lines of business managers don’t tell IT what data is important and what can be archived, then IT doesn’t really have a choice, which makes the premise of moving data to an archive, deduplicated or not – moot.

The real issue is balancing capacity optimization (to what granularity you deduplicate data) against performance on the appropriate tier of data, given that deduplication will happen on all tiers of storage.  The higher the performance requirements (tier 1) the less ‘optimized’ I make the data, the lower the performance requirements (tier x, archive) the more optimized I make the data.  The benefits to the customer are that I can A) optimize data, consistently among each of its devices, and B) it can be cumulative from device to device, removing silos of deduplicated data across the stack.

For more on tiered dedupe, read my Betamax Redux blog post on EMC’s vision for deduplication and hopefully this will put you on a high performance ‘Road to Recovery’.

Post to Twitter Tweet This Post

Scridb filter

A Data Protection Reference Architecture – The Final Chapter

September 1st, 2009 Steve Kenniston 2 comments

The Architecture

This ‘architecture’ diagram, as you can see, is not a typical architecture diagram, but hopefully it can be used to align your business and business objectives with the technologies that are available and can best be applied to solve your issues helping to balance, cost, complexity and compliance.

This diagram can also be used to do a couple of other things.  It can help you begin to classify your data and align your  data to your business objectives.  It also lets you begin to identify what data or data services in your environment that may be more important to you than others and based on this help you to choose areas you may want to outsource or move to the cloud.

As you can tell, there really is not one solution for meeting all your data protection needs.  The challenge comes with managing multiple solutions in an effort to meet your business objectives.  While there are only a few technologies available that allow you to manage your environment across all your RPOs and RTOs, it is important that I point out EMC’s NetWorker is able to do this, centralizing your data protection infrastructure  for ease of management.  It allows you to manage traditional backup, source based deduplicated backup with Avamar, CDP with RecoverPoint, as well as the EMC disk libraries and tape where the data is stored.  Now, I am not saying that NetWorker solves all of your data protection challenges, nor am I suggesting that replacing one traditional backup technology for another is the right answer, but what I am saying is that if you’re looking to have all the feature functionality required to meet all your business objectives and you want easier management, NetWorker is one avenue to get you there.  Additionally, the underlying image of the triangle represents data protection management.  Putting all the new technology in place is one thing, managing it, and ensuring you are now meeting your business needs is another.  EMC’s Data Protection Advisor can help here as well.

This diagram can help customers layout a new, better data protection schema for their environment and start thinking about data protection a bit more strategically versus tactically.  It can also help vendors speak to customers about how they should look at their environment in order to identify specific challenges and the means they need to alleviate these challenges , taking backup, beyond.

Post to Twitter Tweet This Post

Scridb filter

A Data Proteciton Reference Architecture – Part 4

August 27th, 2009 Steve Kenniston No comments

Business Critical Applications

The tip of the triangle focuses on the applications (or data) that drives your business.  It is these applications within your business that, should they go down for any length of time, cost you money.  The recovery of this information, in the event of a ‘disaster’, needs to be very fast (RTO in minutes) and the data can’t be very ‘old’ when it is recovered (short RPO, less than 24 hours).   Typically,  the technologies that are used for these types of applications are replication (synchronous or asynchronous) or continuous data protection (CDP).  These technologies ensure that recovery at the alternate location  are instant (or near instant) and / or give users the ability to pick a point in time they want to recover to in order to ensure no data loss and the ability to bring up the applications as fast and accurately as possible.  This category, much like the rest of them, have the same disclaimer, ‘one size (product) does not fit all’.  Depending upon the value of the data in this tier, and the risk to the business if this data is unavailable drives the technology and spend in this part of the triangle.  Keep in mind, the right technology (Don’t choose CDP if you need an active remote file system) gives you the best recovery (RPO) for your business needs and can keep you on the Road to Recovery.

Post to Twitter Tweet This Post

Scridb filter

A Data Protection Reference Architecture – Part 1

August 14th, 2009 Steve Kenniston No comments

This blog will have multiple parts.  I will introduce my view of a data protection reference architecture and the next few blog posts will talk to components of that architecture.

The other day  I had a very interesting conversation with a colleague of mine in Australia.  He was looking for a data protection reference architecture that he could use to speak to his customer.  As you can imagine having this conversation over the phone could pose to be a difficult challenge.  When the conversation began, my fear was he was looking for an ‘architecture’ diagram that included data protection appliances, backup servers, disk libraries, tape libraries and backup agents.  I quickly realized that this is an impossible conversation to have with him without knowing:

A)     the customer’s environment or challenges

B)      the customer’s business objectives

I find that most vendors don’t know A or B when speaking to a customer about their data protection ‘issues’, but they really should.  Having a more thoughtful conversation with customers in a consultative fashion is more relevant to customers in understanding their challenges and helping to align these challenges to the best possible solution.

I started my conversation with the diagram shown below (Figure 1).  A simple triangle divided horizontally into 4 segments and the middle two segments divided vertically in half.  Each segment represents different business objectives within a company.  As you go around the triangle, you can see that there are different technologies and different methodologies for attacking data protection challenges, which is why there is no longer a “one size fits all” approach when it comes to protecting data today. Let’s face it; the two most important commodities in backup are time and capacity.  One of the primary drivers behind the type of protection that is used is the Recovery Point Objective or RPO.  Different technologies provide different RPOs and each has a different price point as well as there are different processes that can be applied to attach RPOs.

Figure 1

Figure 1

Having a conversation specific to this diagram can have a tremendous amount of value on a number of fronts, including; aligning technology needs with business objectives as well as highlighting critical pain points and beginning a roadmap that helps implement data protection technology based on business needs and budget and put you on the Road to Recovery.

The next post will cover the foundation of the triangle – Archive.

Post to Twitter Tweet This Post

Scridb filter

Road to ‘Data’ Recovery – 12 Steps

March 25th, 2009 Steve Kenniston 2 comments

Hi, my name is Steve and I have a recovery problem.  Well, a data recovery problem that is.  So, I think it is about time that I apply the ‘12 steps’ to help me with my data recovery problem.

Step 1 – It is time that I admit that I am powerless over my backup environment and my data protection world is unmanageable.

Step 2 – I have come to believe that there is a Technology greater that I that can help me restore (my sanity).

Step 3 – I have made a decision to put our company’s data and the process of recovery into the hands of a true data protection specialist.

Step 4 – I have helped to create a classified inventory of our company’s data.

Step 5 – I will admit to our CEO that I have failed at 63% of my recovery attempts costing the business $MMs.

Step 6 – I am prepared to have the new data protection administrator remove all of my defective technologies.

Step 7 – I will humbly ask ‘him’ to remove all of my failed processes.

Step 8 – I must make a list of all the people I have been unable to recover data for and be willing to try to restore their lost information.

Step 9 – I must make amends to all the people I have been unable to recover data for.

Step 10 – I must continue to take an inventory of all the tapes we have and promptly convert them to a newer technology to enable faster recovery.

Step 11 – I will seek out best of bread technology, parnters and vendors to improve our company’s capabilities for daily operational recovery.

Step 12 – Having had this spiritual awakening as a result of these steps, I will carry this message to all IT administrators who are challenged with data recovery issues.

I believe that by following these 12 steps, I will have put our company back on… the Road to ‘Data’ Recovery.

Posted by Steve Kenniston

Post to Twitter Tweet This Post

Scridb filter

Road to Recovery

February 14th, 2009 Steve Kenniston No comments

Our domain, Backup & Beyond was the tagline for Avamar Technologies, a company EMC acquired in November of 2006.  This tagline was very fitting from a data protection standpoint because Avamar utilized a traditional client / server architecture to protect data but with a twist.  Avamar utilizes a more intelligent client side agent that provides source based, variable block deduplication to enable the most efficient backups available in the market for more than 80% of a data centers data.  Avamar also leverages this same technology to replicate this data between disk based backup targets there by dramatically reducing the reliance on tape.  This new technology, that has enabled new processes is taking backup beyond.

The title of our blog, Road to Recovery – well, like every good title it is a play on words and trust me, as with every title it took us a while to come up with it.  That said, the industry has been talking about the fact that backup is really about recovery.  The same can be said for other data protection tools.  This is why our goal is to talk about methodologies (technologies and processes) that help you to recover data.  When IT professionals are polled, they often say that data protection (backup) is still the number one issue they have in the data center.  We say it is time to stepup and admit it and start the ‘Road to Recovery’ when it comes to your data protection environment.

Let us know what your challengs are, we are here for you, your support system and we welcome you comments and questions.

Post to Twitter Tweet This Post

Scridb filter

Twitter links powered by Tweet This v1.6.1, a WordPress plugin for Twitter.