Archive

Posts Tagged ‘Archive’

Comprehensive Capacity Optimization – Deduplication 2.0

October 7th, 2009 Steve Kenniston No comments

Technology is great isn’t it?  When someone thinks they have a new idea on the same old technology foundation they call it “X 2.0″.  I have been watching the banter between analysts and vendors (specifically NTAP’s Dr. Dedupe and Permabit’s CEO Tom Cook) on the topic of Deduplication 2.0 and it is my belief that the proverbial boat is being missed (since we are using water analogies).  I have been watching these guys hash it out for the past few weeks and decided I have to jump in.  I find the real value to these conversations is the value to the end user.  At the end of the day, it doesn’t really matter who ‘coined’ or ‘invented’ a term (like deduplication 2.0) but what does matter is if  the term actually helps describe a technology and how that technology can be leveraged to make things better in the data center.  We should focus on the implications of this new generation of deduplication – ‘deduplication 2.0’.

In May I delivered a presentation to a number of EMC customers on the topic of Data Deduplication 2.0 – Comprehensive Capacity Optimization.  The point of my presentation was simple (and keep in mind this was before the Data Domain acquisition); there are a number of capacity optimization technologies/capabilities that are available to customers today.  Originally these deduplication technologies were used primarily for backup purposes but slowly, deduplication is making its way into primary storage. Deduplication in primary storage makes a lot of sense FOR DATA THAT IS STATIC.  Why only static data?  Static data is data that isn’t used frequently (doesn’t mean it’s not important, it just simply is not accessed often); because access to this data is infrequent, the performance requirements for this data is less than that of active data. Remember; nothing in IT is free.  If I deduplicate data, in order to use it, I must ‘rehydrate’ it and thus there is a performance implication so I want to be careful where I deduplicate data so as not to inhibit performance on production data.

Dr. Dedupe and Tom allude to Deduplication 2.0 moving beyond backup storage and into primary storage.  While deduplication in primary storage is technically possible, it is important that customers understand two important points:

1) Performance: whatever I do to deduplicate (I like optimize) capacity in order to save space, I must ‘undo’ in order to use the data.  If I set a policy that says any data that is 30 days old can be ‘optimized’, I need to be sure that data 30 days old is not active or I could pay a substantial performance penalty when using this data.  I may set a policy ‘any data that hasn’t be touched in 30 days, can be optimized.  I would just want to make sure that there is no scenario where at the end of a quarter let’s say, I would need to rehydrate all data in order to run some report.

2) Comprehensive and cumulative deduplication throughout my storage tiers.  What do I mean?  If I compress and single instance (deduplicate) data on my primary storage utilizing one set of deduplication technologies, say single instancing and compression algorithms, and then I backup this data using sub-file deduplication, a separate set of algorithms, then what I am left with are two separate sets of deduplicated data silos, and no one wins in this scenario.

It is important, no matter what deduplication technology you decide to use, that you can actually leverage the data stored in the deduplication device and that as data moves from device to device it doesn’t need to be rehydrated before it is moved.

A great use case of capacity optimization in primary storage is how EMC evolved the Celerra product this year.  Through a policy, let’s say any data that is older than 30 days, is compressed and stored as a single instance, with users seeing as much as 30% to 50% storage savings.

The real goal of Deduplication 2.0, and I think Dr. Dedupe alluded to this in his post “The Dedupe 2.0 Pundits Are Still Swimming in Lake 1.0” is that customers win when deduplication technology is a part of the core system or file system, when I no longer need to rehydrate data as I move it from primary storage to secondary storage.  If each storage device in the ’stack’ understands the language of the device in the stack ahead of it and the ‘deduplication’ or file system is coordinated and cumulative from device to device than the customer is the winner.  This pertains to primary storage, backup storage and archive storage.  Never having to rehydrate data allows for more efficiency and a reduced tax on devices that can save the end user money.

Tom Cook, CEO of Permabit points out in his blog post “Dedupe 1.0 vs. Dedupe 2.0: The debate ensues” that the only value to deduplication for primary storage is to move your data to a deduplicated archive which allows you to store data, efficiently, long term which I agree with, but as we have seen, not that practical.  Why? Because at the end of the day, the costs to manage storage are going up, up, up and the costs to buy storage are going down, down, down.  End users (NOT IT) are generally lazy or should I really say, just too busy to manage this storage.  In order to properly archive data, you need to have a policy that tells you what to move and when to move it.  IT can make all the recommendations in the world about the value of archive, but if users or really, lines of business managers don’t tell IT what data is important and what can be archived, then IT doesn’t really have a choice, which makes the premise of moving data to an archive, deduplicated or not – moot.

The real issue is balancing capacity optimization (to what granularity you deduplicate data) against performance on the appropriate tier of data, given that deduplication will happen on all tiers of storage.  The higher the performance requirements (tier 1) the less ‘optimized’ I make the data, the lower the performance requirements (tier x, archive) the more optimized I make the data.  The benefits to the customer are that I can A) optimize data, consistently among each of its devices, and B) it can be cumulative from device to device, removing silos of deduplicated data across the stack.

For more on tiered dedupe, read my Betamax Redux blog post on EMC’s vision for deduplication and hopefully this will put you on a high performance ‘Road to Recovery’.

Post to Twitter Tweet This Post

Scridb filter

A Data Protection Reference Architecture – The Final Chapter

September 1st, 2009 Steve Kenniston 2 comments

The Architecture

This ‘architecture’ diagram, as you can see, is not a typical architecture diagram, but hopefully it can be used to align your business and business objectives with the technologies that are available and can best be applied to solve your issues helping to balance, cost, complexity and compliance.

This diagram can also be used to do a couple of other things.  It can help you begin to classify your data and align your  data to your business objectives.  It also lets you begin to identify what data or data services in your environment that may be more important to you than others and based on this help you to choose areas you may want to outsource or move to the cloud.

As you can tell, there really is not one solution for meeting all your data protection needs.  The challenge comes with managing multiple solutions in an effort to meet your business objectives.  While there are only a few technologies available that allow you to manage your environment across all your RPOs and RTOs, it is important that I point out EMC’s NetWorker is able to do this, centralizing your data protection infrastructure  for ease of management.  It allows you to manage traditional backup, source based deduplicated backup with Avamar, CDP with RecoverPoint, as well as the EMC disk libraries and tape where the data is stored.  Now, I am not saying that NetWorker solves all of your data protection challenges, nor am I suggesting that replacing one traditional backup technology for another is the right answer, but what I am saying is that if you’re looking to have all the feature functionality required to meet all your business objectives and you want easier management, NetWorker is one avenue to get you there.  Additionally, the underlying image of the triangle represents data protection management.  Putting all the new technology in place is one thing, managing it, and ensuring you are now meeting your business needs is another.  EMC’s Data Protection Advisor can help here as well.

This diagram can help customers layout a new, better data protection schema for their environment and start thinking about data protection a bit more strategically versus tactically.  It can also help vendors speak to customers about how they should look at their environment in order to identify specific challenges and the means they need to alleviate these challenges , taking backup, beyond.

Post to Twitter Tweet This Post

Scridb filter

A Data Protection Reference Architecture – Part 1

August 14th, 2009 Steve Kenniston No comments

This blog will have multiple parts.  I will introduce my view of a data protection reference architecture and the next few blog posts will talk to components of that architecture.

The other day  I had a very interesting conversation with a colleague of mine in Australia.  He was looking for a data protection reference architecture that he could use to speak to his customer.  As you can imagine having this conversation over the phone could pose to be a difficult challenge.  When the conversation began, my fear was he was looking for an ‘architecture’ diagram that included data protection appliances, backup servers, disk libraries, tape libraries and backup agents.  I quickly realized that this is an impossible conversation to have with him without knowing:

A)     the customer’s environment or challenges

B)      the customer’s business objectives

I find that most vendors don’t know A or B when speaking to a customer about their data protection ‘issues’, but they really should.  Having a more thoughtful conversation with customers in a consultative fashion is more relevant to customers in understanding their challenges and helping to align these challenges to the best possible solution.

I started my conversation with the diagram shown below (Figure 1).  A simple triangle divided horizontally into 4 segments and the middle two segments divided vertically in half.  Each segment represents different business objectives within a company.  As you go around the triangle, you can see that there are different technologies and different methodologies for attacking data protection challenges, which is why there is no longer a “one size fits all” approach when it comes to protecting data today. Let’s face it; the two most important commodities in backup are time and capacity.  One of the primary drivers behind the type of protection that is used is the Recovery Point Objective or RPO.  Different technologies provide different RPOs and each has a different price point as well as there are different processes that can be applied to attach RPOs.

Figure 1

Figure 1

Having a conversation specific to this diagram can have a tremendous amount of value on a number of fronts, including; aligning technology needs with business objectives as well as highlighting critical pain points and beginning a roadmap that helps implement data protection technology based on business needs and budget and put you on the Road to Recovery.

The next post will cover the foundation of the triangle – Archive.

Post to Twitter Tweet This Post

Scridb filter

What Happened in Vegas, Stayed in Vegas

June 21st, 2009 Steve Kenniston No comments

Well, until now.  This is an interesting story about archiving and how it could have, but didn’t help a friend of mine.

Often, when speaking with customers, I talk to them about the 4 fundamental principals with regard to data protection:

  1. Assess
  2. Archive
  3. Backup
  4. Manage

The assessment phase is a multi-dimensional phase.  It’s about people, process and technology.  Like with most things, the technology piece is the easy piece.  EMC has tools that allow us to scan file systems, data bases and email systems that report back a litany of information including but not limited to:

  • Number of files
  • Age of files
  • Volume of data
  • Owner of the data

Once EMC passes the information to the customer about their data, the real hard work begins.  Armed with the information, IT now has to go and speak to line of business managers in order to determine the value of the data, and how data of a specific value needs to be managed and protected.  The problem is line of business managers want everything saved forever, until IT tells them what the bill would be.  IT begins to describe the different ‘classes’ of service capabilities and line of business managers, who don’t really care about the details (not because they don’t care, they are just too busy), finally say “Just give me the highest level of protection I can get for the least amount of money.”  IT now does the best they can to align their perceived value of the data, to the most appropriate backup and archive capabilities they have.

Now, in Vegas, I think we can all agree that the video surveillance has a ton of value to  the stake holders of the hotels and casinos.  The amount of debauchery that takes place in Vegas with the amount of money that is ‘rolling’ around Vegas, it is important to ‘know what is going on’ and to make sure all situations can be handled as efficiently as possible and this is where video surveillance comes into play and the more you ’save’ on high speed disk, the easier it is to get to the truth or solve the mystery.

The exception is that this data is not available for just any general purpose.  Case in point.  A good friend of mine, lets call him ‘Josh’ was running around Vegas one evening having a grand time.  He and some friends ran into a group of young ladies and had a great time seeing the sights of Vegas for the rest of the evening.  As the night was winding down and people were going back to their hotels, Josh, being a very nice guy decided to ensure his ‘date’ made it back to her hotel safely.  He rode with her in the cab and then walked her to her hotel room.  Now, if any of you have been to Vegas, you know that from the cab stand to the room can be a mile and you will take one of several elevators and walk down one of many corridors to a hotel door that looks exactly like the other 3500 in the building.

They young lady asked Josh in to talk and to say good night and as time went past, they talked all night until the fell asleep.  Josh, having to catch a flight the next afternoon, and not wanting to wake anyone decided to quietly leave early in the am.  Josh then took a cab back to his hotel and when he went to pay the cab driver, he realized that his wallet was gone.  After calling all the places they had been the night before, Josh was convinced that he had left / lost the wallet in hotel room of the young lady and decided to call her.  First problem.  He didn’t know the room number.  He didn’t even remember the floor she was on.  Josh went back to the hotel and started to go up and down the elevator and walk down the halls looking for anything that looked familiar so he could knock on the door and ask if he had lost his wallet in the room.  After  a few hours of walking the halls, he had his first great idea, instead of walk throughout the hotel, how about call every room?  As he started doing that, he realized he still had about 2500 more rooms to call and with his cell running out of juice and not wanting to be a spectacle in the lobby he had is second brilliant idea.  Lets ask the security department if he can have a look t the video surveillance to see if they can tell him which floor he went to the night before and what hallway he walked down so he could, perhaps,  more easily find his wallet.

Well, the security department was less than sympathetic to Josh’s request (I would bet they get this question a lot).  In fact, the security department would not even comment on the fact as to whether or not they even had video cameras covering the different areas of the hotel for ’security reasons’.  (Reminds me of a time when I worked at VERITAS and we sold some software to Bank of NY who told us to not divulge what they had purchased because they considered this piece of technology a competitive edge.)

Defeated, Josh left his name with the hotel, went back to his hotel.  It has been over 7 hours of searching and is now just moments before checkout and him having to go to the airport.

Just goes to show you, having the data, doesn’t always put you on the Road to Recovery.

(BTW: Josh got a call on the way to the airport, the hotel ‘found’ his wallet and would be mailing it to him.  What a relief.)

Post to Twitter Tweet This Post

Scridb filter
Categories: Archive, EMC Tags: , , ,

Information Classification – IT’s Hardest Job

April 16th, 2009 Steve Kenniston No comments

I have decided information today, is like a group of friends. If you look at my LinkedIn page or my Facebook page you see that I have over 600 connections and over 180 friends respectively. What does this really mean? Obviously don’t stay in touch with all of these people. So why do we have these connections? I think it is because we believe that in the future, each one of these connections will offer some kind of value to us. It may be that they will be a friend to us, they may share common experiences to help us through a personal issue, and they may help us find a mate or even a job. We just don’t know so we hang on to the connection.

This is not unlike information. We are all tired of hearing that “data is growing at an exponential rate” but we never look at why. It is simple. We believe that ‘someday’ we may need that ‘valuable’ piece of content so we better not delete it. More importantly, the people who are accountable for managing that data (IT) are one step removed from the ‘value’ discussion (usually) so rather than delete anything and be responsible for “loosing data” they save and protect everything.

Recently I spent 4 hours on my Facebook page ‘categorizing’ my friends. I created a number of categories, friends from high-school, friends from college, colleagues from work (current), colleagues from work (past), industry connections and relatives. As you can imagine there are some friends that belong in more than one category – so how do I choose which one they should go in? Also, what happens if I change jobs? Where do the ‘colleagues (work)’ friends go? When do I move them? Do I remember to move them?

I have often said when presenting to customers, “EMC can help you with all aspects of you data except for one thing. EMC will never know the value of a piece of your content to you. You have to tell us, and then we can manage it properly.” Typically when customers hear that statement, they agree, but they also agree that the process of classifying data is a daunting task. You can see the challenge of just organizing friends in Facebook. There are so many permeations of how data can be classified that IT chooses the path of least resistance, store and protect everything.
While storing and protecting everything is easy, it also hits at the three biggest challenges IT are faced with; cost, complexity and compliance. These three vulnerabilities are the toughest to balance because not only are they important in their own right, they also are interdependent. As data grows, the inability to protect it grows which means IT either needs to spend more money or be out of compliance.

The cycle is only broken when new processes are introduced. These processes are a part of a key message when it comes to data protection; assess (classify), archive, backup, manage. Only when customers believe that the struggle of trying to keep cost, complexity and compliance in check happens when a new process is introduced, can the cycle be broken. Once new processes are in place, the data center can become more efficient.

Consider this analogy: In July 1936 Henry Philips received a patent on a new type of screw and screw driver he had invented. This new “technology” changed the world of mass production and machine repair.

He didn’t set out to make the life of hand tools easier, he was trying to solve an industrial problem. The new screw and screwdriver was designed for use with power tools and more specifically power tools on an assembly line.

The slot in the screw allowed itself to seat itself in the tool automatically when contact is made which saves a second or two and if you have 100’s or 1000’s of screws like in cars or airplanes then it saves a great deal of time.

In 1938 Henry was able to get the American Screw company to spend a $500,000 to develop a manufacturing process around the new screw. By 1940 nearly all of the American manufactures had switched to the new process and the new screws. It made all the assembly of military air craft and jeeps much more efficient. Having these vehicles made faster and more efficiently contributed to a competitive advantage.

So, it’s like I say when talking to customers; “The hardest thing to change in the data center is not technology it is process “. Once the psychological inertia of dealing with a new process is overcome, then progress can be made.

Once customers start to classifiy their information (assign value to it), they can begin to archive their ‘old’ data.  This will still provide them access to it, just not as quickly. Once this data is removed from the backup stream, backups will then run much more efficiently. Additionally, deploying new technologies such as deduplication for specific data types (realized during a proper classification effort) allows IT to more efficiently backup specific data types in specific areas for much lower cost. Now that all the work has gone into establishing a new set of processes, IT will want to continue to manage this new set of processes to ensure that all the hard work they have done has tangible business capabilities. New processes can help IT attack cost, complexity and compliance but it all starts with information classification.

Posted by Steve Kenniston

Post to Twitter Tweet This Post

Scridb filter

Don’t forget to Archive

April 2nd, 2009 Rob Emsley No comments

Hello, my name is Rob and I’ve been recovering for many years.  Recovering Data that is:-)

Before considering many of the new innovations to help improve backup I suggest that you look at implementing an archive first. This will reduce your primary storage usage dramatically and make backup easier.

At EMC we started archiving our employee e-mail at the start of 2007. Personally, this meant no more management of PST files. Management involved creating PST files on my notebook, manually moving e-mails and then performing my own backups to ensure that I always was able to recover. Basically, I was my own backup administrator.

Today EMC announced EMC SourceOne, a new family of products for archiving, e-discovery and compliance.

  • EMC SourceOne Email Management archives e-mail from Microsoft Exchange and IBM Lotus Notes/Domino as well as SMTP and instant messages to improve operational efficiency of messaging systems, reduce production, storage and backup costs and enhance message retrieval and system recoveries.
  • EMC SourceOne Discovery Manager provides high volume discovery search and collection for e-mail archived by the SourceOne Email Management. It can quickly find, safely hold, efficiently cull and defensibly produce archived e-mail in response to legal/regulatory notice and/or corporate policy complaint. Discovery Manager is built around a legal matter or case metaphor and supports secure authorized investigator access, defensible collection results and chain of custody.
  • EMC SourceOne Discovery Collector is an indexing appliance that automates the in-house identification, collection, preservation, and policy management of unstructured content that resides on data sources such as desktops, laptops, common Internet file systems (CIFS) and network file systems (NFS), networked attached storage, Microsoft Exchange, SharePoint and other content management repositories

I would describe EMC SourceOne Email Management as a 2nd generation e-mail archiving product, delivering an architecture capable of supporting even the most demanding requirements, especially as E-mail continues to be a critical application for most customers.  SourceOne components can be deployed on just a single server or distributed across multiple physical or virtual servers. To support the EMC user community the new product is being implemented on a VMware ESX infrastructure which will allow for easy configuration changes.

No more e-mail backups for me as all the messages I keep are either stored on our Exchange servers or  archived onto our EMC Centera storage.  One less thing for me to worry about.

Posted by Rob Emsley

Post to Twitter Tweet This Post

Scridb filter
Categories: Archive Tags: ,

Road to Recovery

February 14th, 2009 Steve Kenniston No comments

Our domain, Backup & Beyond was the tagline for Avamar Technologies, a company EMC acquired in November of 2006.  This tagline was very fitting from a data protection standpoint because Avamar utilized a traditional client / server architecture to protect data but with a twist.  Avamar utilizes a more intelligent client side agent that provides source based, variable block deduplication to enable the most efficient backups available in the market for more than 80% of a data centers data.  Avamar also leverages this same technology to replicate this data between disk based backup targets there by dramatically reducing the reliance on tape.  This new technology, that has enabled new processes is taking backup beyond.

The title of our blog, Road to Recovery – well, like every good title it is a play on words and trust me, as with every title it took us a while to come up with it.  That said, the industry has been talking about the fact that backup is really about recovery.  The same can be said for other data protection tools.  This is why our goal is to talk about methodologies (technologies and processes) that help you to recover data.  When IT professionals are polled, they often say that data protection (backup) is still the number one issue they have in the data center.  We say it is time to stepup and admit it and start the ‘Road to Recovery’ when it comes to your data protection environment.

Let us know what your challengs are, we are here for you, your support system and we welcome you comments and questions.

Post to Twitter Tweet This Post

Scridb filter

Twitter links powered by Tweet This v1.6.1, a WordPress plugin for Twitter.