I2RS and Remote Triggered Black Holes

In our last post, we looked at how I2RS is useful for managing elephant flows on a data center fabric. In this post, I want to cover a use case for I2RS that is outside the data center, along the network edge—remote triggered black holes (RTBH). Rather than looking directly at the I2RS use case, however, it’s better to begin by looking at the process for creating, and triggering, RTBH using “plain” BGP. Assume we have the small network illustrated below—


In this network, we’d like to be able to trigger B and C to drop traffic sourced from 2001:db8:3e8:101::/64 inbound into our network (the cloudy part). To do this, we need a triggering router—we’ll use A—and some configuration on the two edge routers—B and C. We’ll assume B and C have up and running eBGP sessions to D and E, which are located in another AS. We’ll begin with the edge devices, as the configuration on these devices provides the setup for the trigger. On B and C, we must configure—

  • Unicast RPF; loose mode is okay. With loose RPF enabled, any route sourced from an address that is pointing to a null destination in the routing table will be dropped.
  • A route to some destination not used in the network pointing to null0. To make things a little simpler we’ll point a route to 2001:db8:1:1::1/64, a route that doesn’t exist anyplace in the network, to null0 on B and C.
  • A pretty normal BGP configuration.

The triggering device is a little more complex. On Router A, we need—

  • A route map that—
    • matches some tag in the routing table, say 101
    • sets the next hop of routes with this tag to 2001:db8:1:1::1/64
    • set the local preference to some high number, say 200
  • redistribute from static routes into BGP filtered through the route map as described.

With all of this in place, we can trigger a black hole for traffic sourced from 2001:db8:3e8:101::/64 by configuring a static route at A, the triggering router, that points at null0, and has a tag of 101. Configuring this static route will—

  • install a static route into the local routing table at A with a tag of 101
  • this static route will be redistributed into BGP
  • since the route has a tag of 101, it will have a local preference of 200 set, and the next hop set to 2001:db8:1:1::1/64
  • this route will be advertised via iBGP to B and C through normal BGP processing
  • when B receives this route, it will choose it as the best path to 2001:db8:3e8:101::/64, and install it in the local routing table
  • since the next hop on this route is set to 2001:db8:1:1::1/64, and 2001:db8:1:1::1/64 points to null0 as a next hop, uRPF will be triggered, dropping all traffic sourced from 2001:db8:3e8:101::/64 at the AS edge

It’s possible to have regional, per neighbor, or other sorts of “scoped” black hole routes by using different routes pointing to null0 on the edge routers. These are “magic numbers,” of course—you must have a list someplace that tells you which route causes what sort of black hole event at your edge, etc.

Note—this is a terrific place to deploy a DevOps sort of solution. Instead of using an appliance sort of router for the triggering router, you could run a handy copy of Cumulus or snaproute in a VM, and build scripts that build the correct routes in BGP, including a small table in the script that allows you to say something like “black hole 2001:db8:3e8:101::/64 on all edges,” or “black hole 2001:db8:3e8:101::/64 on all peers facing provider X,” etc. This could greatly simplify the process of triggering RTBH.

standardsNow, as a counter, we can look at how this might be triggered using I2RS. There are two possible solutions here. The first is to configure the edge routers as before, using “magic number” next hops pointed at the null0 interface to trigger loose uRPF. In this case, an I2RS controller can simply inject the correct route at each edge eBGP speaker to block the traffic directly into the routing table at each device. There would only need to be one such route; the complexity of choosing which peers the traffic should be black holed on could be contained in a script at the controller, rather than dispersed throughout the entire network. This allows RTBH to be triggered on a per edge eBGP speaker basis with no additional configuration on any individual edge router.

Note the dynamic protocol isn’t being replaced in any way. We’re still receiving our primary routing information from BGP, including all the policies available in that protocol. What we’re doing, though, is removing one specific policy point out of BGP and moving it into a controller, where it can be more closely managed, and more easily automated. This is, of course, the entire point of I2RS—to augment, rather than replace, dynamic routing used as the control plane in a network.

Another option, for those devices that support it, is to inject a route that explicitly filters packets sourced from 2001:db8:3e8:101::/64 directly into the RIB using the filter based RIB model. This is a more direct method, if the edge devices support it.

Either way, the I2RS process is simpler than using BGP to trigger RTBH. It gathers as much of the policy as possible into one place, where it can be automated and managed in a more precise, fine grained way.


Thoughts on the Tomahawk II

Broadcom released some information about the new Tomahawk II chip last week in a press release. For those who follow hardware, there are some interesting points worth considering here.

First, the chip supports 256x25g SERDES. Each pair of 25G SERDES can be combined into a single 50g port, allowing the switch to support 128 50g ports. Sets of four SERDES can be combined into a single 100g port, allowing the switch to support 64 100g ports.

Second, there is some question about the table sizes in this new chip. The press release notes the chip has “Increased On-Chip Forwarding Databases,” but doesn’t give any precise information. Information from vendors who wrap sheet metal around the chipset to build a complete box don’t seem to be too forthcoming in their information about this aspect of the new chip, either. The Tomahawk line has long had issues with its nominal 100,000 forwarding table entry limit, particularly in large scale data center fabrics and applications such as IX fabrics. We’ll simply have to wait to find out more about this aspect of the new chip, it seems.

Third, there is some question about the forwarding buffers available on the chip. Again, the Tomahawk line has long been known for its very shallow buffers. While these generally aren’t a problem in well tuned data center fabrics (in fact, to the contrary, you don’t want a lot of buffering in time sensitive applications), there are situations where deeper buffers would be useful. The press release notes the new chip has “Higher Capacity Embedded Packet Buffer Memory,” but gives no details of what those larger buffers might look like.

Russ’ Take

While other vendors have supported 64x100g ports on their chips for the last year or so, this brings the 64x100g platform to a wider array of manufacturers, equaling many mid range (and potentially high end) switches available from mainline vendors. The fallout should be interesting in these areas; as Broadcom and other merchant silicon makers move up the performance scale, vendors are going to need to respond in some way. Analytics seems to be a natural path, but it won’t be long before chips in this area are also available from merchant silicon shops. Even this new chip, Broadcom says, has “Enhanced BroadView network telemetry features.” The value vendors add is being salami sliced into ever thinner pieces. This has implications for vendor partners, as well, of course—as the vendor’s added value drops through commodity components, there’s less “residual value” for a VAR to add.

It’s not all peaches and cream, though; there are still questions to ask, and things to think about. One issue with a switch operating at this density will be the faceplate; it’s difficult to see how the port counts on this chip will fit into a 1RU device (pizza box sized); manufacturers are going to need to be creative to build boxes in a smaller footprint that take full advantage of the potential here. The problems here aren’t just limited to physical real estate in the 1RU format, but also power consumption and heat dissipation, particularly in the optical units. It will be interesting to see what develops on this front.

This new chip looks interesting for a number of reasons, particularly for disaggregators and vendors who bundle the chip with their hardware and software to deliver a complete package. At the very least, merchant silicon is nipping at the tail of many vendor custom products—just another sign of how quickly the network market is changing.


On the ‘net: Software Patents

What should we do with software patents? I’ve seen both sides of the debate, as I work a great deal in the context of standards bodies (particularly the IETF), where software patents have impeded progress on a community-driven (and/or community-usable) standard. On the other hand, I have been listed as a co-inventor on at least 40 software patents across more than twenty years of work, and have a number of software patents either filed or in the process of being filed. —CircleID


Fabric versus Network: What’s the Difference?

We often hear about fabrics, and we often hear about networks—but on paper, an in practice, they often seem to be the same thing. Leaving aside the many realms of vendor hype, what’s really the difference? Poking around on the ‘net, I came across a couple of definitions that seemed useful, at least at first blush. For instance, SDN Search gives provides the following insight

The word fabric is used as a metaphor to illustrate the idea that if someone were to document network components and their relationships on paper, the lines would weave back and forth so densely that the diagram would resemble a woven piece of cloth.

While this is interesting, it gives us more of a “on the paper” answer than what might be called a functional view. The entry at Wikipedia is more operationally based

Switched Fabric or switching fabric is a network topology in which network nodes interconnect via one or more network switches (particularly crossbar switches). Because a switched fabric network spreads network traffic across multiple physical links, it yields higher total throughput than broadcast networks, such as early Ethernet.

Greg has an interesting (though older) post up on the topic, and Brocade has an interesting (and longer) white paper up on the topic. None of these, however, seem to have complete picture. So what is a fabric?

To define a fabric in terms of functionality, I would look at several attributes, including—

  • the regularity and connectiveness of the nodes (network devices) and edges (links)
  • the design of the traffic flow, specifically how traffic is channeled to individually connected devices
  • the performance goals the topology is designed to fulfill in terms of forwarding

You’ll notice that, unlike the definition given by many vendors, I’m not too interested in whether the fabric is treated as “one device” or “many devices.” Many vendors will throw the idea that a fabric must be treated as a single “thing,” unlike a network, which treats each device independently. This is clever marketing, of course, because it differentiates the vendor’s “fabric offering” from home grown (or built from component) fabrics, but that’s primarily what it is—marketing. While it might be a nice feature of any network or fabric to make administration easier, it’s not definitional in the way the performance and design of the network are.

In fact, what’s bound to start happening in the next few years is vendors are going to call overlay systems, or vertically integrated systems, for all sorts of things, like a “campus fabric,” or a “wide area fabric.” Another marketing ploy to watch out for is going to be interplay with the software defined moniker—if it’s “software defined,” is a “fabric.” Balderdash.

Let’s look at the three concepts I outlined above in a little more detail.

Topology Regularity

Fabrics have a well defined, regularly repeating topology. It doesn’t matter if the topology is planar or non-planar, what matters is that the topology is a regularly repeating “small topology” repeated in a larger topology.


In these diagrams, A is a regular topology; note you can take a copy of any four nodes, overlay them on any other four nodes, and see the repeating topology. A is also planar, as no two links cross. B is a nonplanar regular (or repeating) topology. C is not a regular topology, as the “subtopologies” do not repeat between the nodes. D is not a regular topology that is also nonplanar.

The regularity of the topology is a good rule of thumb that will help you quickly pick out most fabrics against most non-fabrics. Why? To understand the answer, we need to look at the rest of the properties of a fabric.

Design of the Traffic Flow

Several of the definitions given in my quick look through the ‘net mentioned this one: in a fabric, traffic is split across many available paths, rather than being pushed onto a smaller number of higher speed paths. The general rule of thumb is—if traffic can be split over a large number of ECMP paths, then you’re probably looking at a fabric, rather than a network. The only way to get a large number of ECMP paths is to create a regularly repeating topology, like the ones shown in A and B above.

Performance Goals

design-boardBut why does the number of ECMP paths matter? Because—fabric performance is normally quantifiable in somewhat regular mathematical terms. In other words, if you want to understand the performance of a fabric, you don’t need to examine the network topology as a “one off.” Perhaps a better way to say this: fabrics are not snowflakes in terms of performance. You might not know why a particular fabric performs a certain way (theoretically), but you can still know how it’s going to perform under specific conditions.

The most common case of this is the ability to calculate the oversubscription rate on the fabric; whatn amount of traffic can the network switch without contention, given the traffic is evenly distributed across sources and receivers? In a fabric, it’s easy enough to look at the edge ports offered, the bandwidth available to carry traffic at each stage, and determine at what level the fabric is going to introduce buffering as a result of link contention. This is probably the crucial defining characteristic of a fabric from a network design perspective.

Another one that’s interesting, and less often considered, is the maximum or typical jitter through the fabric in the absence of contention. If a fabric is properly designed, and the network devices used to build the fabric don’t mess with your math, you can generally get a pretty good idea of what the minimum and maximum delay will be from any edge port to any other edge port on a fabric. Within the broader class of network topologies, this is generally a matter of measuring the actual delays through the network, rather than a calculation that can be done beforehand.

While some might disagree, these are the crucial differences between “any old network topology” and a fabric from my perspective.