Thursday, August 16, 2012

Packet Capture Retention Policy???

How long should I store packet captures? How much storage should I provision to monitor a 10Gbps link? When is NetFlow enough, and when do I need to capture at the packet level? These are questions network operations managers everywhere are asking, because unfortunately best practices for network data retention policies are hard to find. Whereas CIOs now generally have retention policies for customer data, internal emails, and other kinds of files, and DBAs generally know how to implement those policies, the right retention policy for network capture data is less obvious. The good news is that there are IT shops out there that are ahead of the curve and have figured a lot of this out. Background To begin with, it’s important to clarify for your own organization what the goals are for network history. Some common answers include: Respond faster to difficult network issues Establish root cause and long-term resolution Contain cyber-security breaches Optimize network configuration Plan network upgrades. You may notice that the objectives listed above vary in who might use them: stakeholders could include Network Operations, Security Operations, Risk Management, and Compliance groups, among others. While these different teams often operate as silos in large IT shops, in best-practice organizations these groups are cooperating to create a common network-history retention policy that cuts across these silos (and in the most advanced cases, they have even begun to share network-history infrastructure assets, a topic we discussed here). Some of your objectives may be met by keeping summary information – events, statistics, or flow records for example – and others commonly require keeping partial or full packet data as well. A good retention policy should address the different types of network history data, including: Statistics Events Flow records – sampled Flow records – 100% Enhanced flow records or metadata (such as IPFIX, EndaceVision metadata, etc.) Full packet data – control plane Full packet data – select servers, clients, or applications “Sliced” packet headers – all traffic Full packet data – all traffic. Generally speaking, the items at the top of the list are smaller and therefore cheaper to keep for long periods of time; while the items at the bottom are larger and more expensive to keep, but much more general. If you have the full packet data available you can re-create any of the other items on the list as needed; without the full packet data you can answer a subset of questions. That leads to the first principle: keep the largest objects (like full packet captures) for as long as you can afford (which is generally not very long, because the data volumes are so large), and keep summarized data for longer. Next, you should always take guidance from your legal adviser. There may be legal requirements arising from regulation (PCI, Rule 404, IEC 61850, etc.), e-discovery, or other sources; this article is not meant to be legal advice. Now that said, in the absence of specific legal requirements that supersede, here are the best practices we’re seeing in the industry. Working the list from bottom to top: Packet data for all traffic: 72 hours Full packet data or “sliced” packet headers? The choice here will depend on how tightly controlled your network is and on what level of privacy protection your users are entitled to. For highly controlled networks with a low privacy requirement, such as banking, government or public utilities, full packet capture is the norm. For consumer ISPs in countries with high privacy expectations, packet header capture may be more appropriate. General enterprise networks fall somewhere in between. Whichever type of packet data is being recorded, the goal consistently stated by best-practice organizations is a minimum of 72 hours retention, to cover a 3-day weekend. For the most tightly-controlled networks retention requirements may be 30 days, 90 days, or longer. Packet data for control plane & for select traffic: 30+ days Control plane traffic can be extremely useful in troubleshooting a wide variety of issues. It’s also a type of traffic that is owned by the network operator, not the customer, so even networks that don’t record all traffic should keep history here. Traffic types of interest include for example: Routing protocols (OSPF, IS-IS, EIGRP, BGP; plus protocols like RSVP, LDP, BFD, etc. in carriers) L2 control plane (ARP, spanning tree, etc.) ICMP DHCP DNS LDAP, RADIUS, Active Directory Signaling protocols like SIP, H.225.0, SCCP, etc. GTP-C in mobile networks In addition to control plane traffic, in every network there are particular servers, clients, subnets, or applications that are considered particularly important or particularly problematic. For both control-plane and network-specific traffic of interest, organizations are storing a minimum of 30 days of packet data. Some organizations store this kind of data for up to a year. Flow records @ 100%: 120+ days Best-practice organizations record either enhanced metadata (such as that collected by EndaceVision), or at least basic NetFlow v5/v9/IPFIX. This flow data is useful for a wide variety of diagnosis and trending purposes. Although a few router models can generate flow records on 100% of traffic, best-practice is to separate this function onto a separate probe appliance connected to the network via tap, SPAN or matrix switch. The probe appliance both offloads the router/switch and also enhances flow data with DPI / application identification information. Best-practice here is to store at least 120 days of flow data. (We have seen organizations that keep 100% flow records for as long as seven years.) Samples and summaries: 2 years or more sFlow or sampled NetFlow, using 1:100 or 1:1000 packet samples, can be useful for some kinds of trending and for detecting large-scale Denial of Service attacks. There are significant known problems with sampled NetFlow, so it’s not a replacement for 100% flow, but it does have usefulness for some purposes. Summary traffic statistics – taken hourly or daily, by link and by application – can also be helpful in understanding past trends to help predict future trends. Because this data takes relatively little space, and because it is mostly useful for trending purposes, organizations typically plan to keep it for a minimum of two years. One point to remember in maintaining history over periods of a year or longer is that network configurations may change, creating discontinuities. It’s important to record every major network topology change or configuration change alongside your traffic history data, so you don’t compare incomparable data and draw the wrong conclusions. Average vs Peak vs Worst-case? One challenge faced in sizing network-history storage capacity is the fact that well-designed networks run well below 100% capacity most of the time, but in times of stress (which is when network history is most valuable) they may run much hotter. Should you size for 72 hours of typical traffic, or 72 hours of worst-case? The best-practice we’ve seen here is to make sure your network history system can capture at worst-case rate, but has enough storage provisioned for typical rate. The reasoning here is that when the network gets very highly loaded, someone will be dragged out of bed to fix it much sooner than 72 hours, so a long duration of history is not needed; but that person will want to be able to rewind to the onset of the event and will want to see a full record of what was happening immediately before and after, so having a system that records all traffic with zero drops is crucial. Here’s an example to make it concrete: Suppose you have a 10Gbps link that averages 1Gbps over a 24-hour period, and 3Gbps over the busiest hour of the day. Then 72 hours of full packet storage at typical load would require (1Gbit/sec x 72 hours x 3600 sec/hour / 8 bits/byte) = 32400 Gbytes, or about 32 terabytes. Under worst-case load, when recording is most important, it could run at the full 10Gbps, which would fill storage 10 times as fast. The good news is: best-practice here says you do not need to provision 10x the storage capacity, but you should be using a capture system that can record at the full 10Gbps rate. That means that in a worst-case scenario your storage duration would be more like 7 hours than 70; but in that kind of scenario someone will be on the case in much less than 7 hours, and will have taken action to preserve data from the onset of the event. Of course, the same considerations apply for other types of network history: systems need to be able to process and record at the worst-case data rate, but with reduced retention duration. Other considerations The above discussion slightly oversimplifies the case; there are actually two more important considerations to keep in mind in sizing storage for network history. First, most recording systems will store some metadata along with packet captures, and this adds some overhead to the storage needed – typically around 20%, though it may vary depending on the traffic mix and on the recording product you use. Second, while we say above you should provision storage for typical load, most organizations actually use projected typical load, extrapolating the traffic trend out to 18-36 months from design time. How far ahead you look depends on how often you are willing to upgrade the disks in your network recording systems. A three-year upgrade cycle is typical, but with disk capacity and costs improving rapidly there are situations where it can be more cost-effective to provision less storage up front and plan to upgrade every 24 months. Implementing the policy When organizations first take on the challenge of standardizing network-history retention policy, they nearly always discover that their current retention regime is far away from where they think it needs to be. Typically we have seen that implementing a best-practice retention policy happens in six phases: Create the “idealized” policy describing where you want to be, without regard to current state Inventory the current state and identify how far off it is from the ideal Set targets for 3-6 months, 12 months and 24 months Over the 3-6 month horizon, take low-hanging fruit by reconfiguring existing systems to optimize for the new policy, and identify what new technologies will be needed to achieve the chosen retention policy Over the 12-month horizon, pilot any new technologies that may be required to achieve the long-term policy Over the 24-month horizon, roll out these technologies network-wide. Summary checklist Bring together stakeholders to develop a common network-history retention policy Understand everyone’s objectives Check with legal adviser Choose what types of data will be kept for what purposes Set idealized retention goals for each Inventory current state and gaps Close the gaps over 24 months.