Load Balancing Fat MPLS PseudowiresCisco Systems250 Longwater AveReadingRG2 6GBUnited Kingdom+44-208-824-8828stbryant@cisco.comCisco SystemsBrusselsBelgiumcfilsfil@cisco.comDeutsche TelekomMuensterGermanyUlrich.Drafz@t-com.net
Internet
PWE3pseudowireMPLSInternet-DraftWhere the payload carried over a pseudowire carries a number of
identifiable flows it can in some circumstances be desirable to carry
those flows over the equal cost multiple paths that exist in the packet
switched network. This draft describes a method of identifying the
flows, or flow groups, to the label switched routers by including an
additional label in the label stack.The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC2119.A pseudowire is defined as a mechanism that carries the essential
elements of an emulated service from one provider edge (PE) to one or
more other PEs over a packet switched network (PSN) .A pseudowire is normally transported over one single network path,
even if multiple Equal Cost Multiple Paths (ECMP) exit between the
ingress and egress PEs. This is required to preserve the
characteristics of the emulated service (e.g. avoid misordering for
example for SAToP pseudowire’s ).
Except in the extreme case described in Section 6, the new capability
proposed in this draft does not change this default property of
pseudowires.Some pseudowires are used to transport large volumes of IP traffic
between routers at two locations. One example of this is the use of an
Ethernet pseudowire to create a virtual direct link between a pair of
routers. Such pseudowire’s may carry from hundred’s of Mbps
to Gbps of traffic. Such pseudowire’s do not require strict
ordering to be preserved between packets of the pseudowire. They only
require ordering to be preserved within the context of each individual
transported IP flow. Some operators have requested the ability to
explicitly configure such a pseudowire to leverage the availability of
multiple ECMP paths. This allows for better capacity planning as the
statistical multiplexing of a larger number of smaller flows is more
efficient than with a smaller set of larger flows. Although Ethernet is
used as an example above, the mechanisms described in this draft are
general mechanisms that may be applied to any pseudowire type in which
there are identifiable flows, and in which the there is no requirement
to preserve the order between flows.Label switched routers commonly hash the label stack or some
elements of the label stack as a method of discriminating between
flows, in order to distribute those flows over the available equal
cost multiple paths that exist in the network. Since the label at the
bottom of stack is usually the label most closely associated with the
flow, this normally provides the greatest entropy and hence is
normally included in the hash. This draft describes a method of adding
an additional label at the bottom of stack in order to facilitate the
load balancing of the flows within a pseudowire over the available
ECMPs. A similar design for general MPLS use has also been proposed
.An alternative method of load balancing by creating a number of
pseudowires and distributing the flows amongst them was considered,
but was rejected because:It did not introduce as much entropy as the load balance label
method.It required additional pseudowires to be set up and
maintained.An additional label is interposed between the pseudowire label and
the control word, or if the control word is not present, between the
pseudowire label and the pseudowire payload. This additional label is
called the pseudowire load balancing label (LB label). Indivisible
flows within the pseudowire MUST be mapped to the same pseudowire LB
label by the ingress PE. The pseudowire load balancing label
stimulates the correct ECMP load balancing behaviour in the PSN. On
receipt of the pseudowire packet at the egress PE (which knows this
additional label is present) the label is discarded without
processing.Note that the LB label MUST NOT be an MPLS reserved label , but is otherwise unconstrained by the
protocol.To ensure that the load balance label is not used inadvertently
used for forwarding the load balance label MUST have a TTL of 0.The Native Service Processing (NSP) function is a component of a PE
that has knowledge of the structure of the emulated service and is able
to take action on the service outside the scope of the pseudowire. In
this case it is required that the NSP in the ingress PE identify flows,
or groups of flows within the service, and indicate the flow (group)
identity of each packet as it is passed to the pseudowire forwarder.
Since this is an NSP function, by definition, the method used to
identify a flow is outside the scope of the pseudowire design.
Similarly, since the NSP is internal to the PE, the method of flow
indication to the pseudowire forwarder is outside the scope of this
documentThe pseudowire forwarder must be provided with a method of mapping
flows to load balanced paths.The forwarder must generate a label for the flow or group of flows.
How the load balance label values are determined is outside the scope of
this document, however the load balance label allocated to a flow MUST
NOT be an MPLS reserved label and SHOULD remain constant. It is
recommended that the method chosen to generate the load balancing labels
introduces a high degree of entropy in their values, to maximise the
entropy presented to the ECMP path selection mechanism in the LSRs in
the PSN, and hence distribute the flows as evenly as possible over the
available PSN ECMP paths. The forwarder at the ingress PE prepends the
pseudowire control word (if applicable), then prepends either the
pseudowire load balancing label, followed by the pseudowire label.
Alternatively it prepends the pseudowire control word (if applicable),
then selects and appends one of the allocated pseudowire labels.The forwarder at the egress PE uses the pseudowire label to identify
the pseudowire. If the label block approach is used operation is
identical to the current non-load balanced case. Alternatively, from the
pseudowire context, the egress PE can determine whether a pseudowire
load balancing label is present, and if one is present, the label is
discarded.All other pseudowire forwarding operations are unmodified by the
inclusion of the pseudowire load balancing label.The PWE3 Protocol Stack Reference Model modified to include
pseudowire LB label is shown in
belowThe encapsulation of a pseudowire with a pseudowire LB label is
shown in belowWhen using the signalling procedures in , there is a Pseudowire Interface Parameter
Sub-TLV type used to signal the desire to include the load balance label
when advertising a VC label.The presence of this parameter indicates that the egress PE requests
that the ingress PE place a load balance label between the pseudowire
label and the control word (or is the control word is not present
between the pseudowire label and the pseudowire payload).If the ingress PE recognises load balance label indicator parameter
but does not wish to include the load balance label, it need only issue
its own label mapping message for the opposite direction without
including the load balance label Indicator. This will prevent inclusion
of the load balance label in either direction.If PWE3 signalling is not in use for a
pseudowire, then whether the load balance label is used MUST be
identically provisioned in both PEs at the pseudowire endpoints. If
there is no provisioning support for this option, the default behaviour
is not to include the load balance label.Note that what is signalled is the desire to include the load balance
label in the label stack. The value of the label is a local matter for
the ingress PE, and the label value itself is not signalled.The structure of the load balance label TLV is shown in .Where:LBL is the load balance label TLV identifier assigned by
IANA.Length is the length of the TLV in octets and is 4.The following OAM considerations apply to this method of load
balancing.Where the OAM is only to be used to perform a basic test that the
pseudowires have been configured at the PEs, VCCV messages may be sent using any
load balance pseudowire path, i.e. over any of the multiple pseudowire
labels, or using any pseudowire load balance label.Where it is required to verify that a pseudowire is fully functional
for all flows, VCCV connection
verification message MUST be sent over each ECMP path to the pseudowire
egress PE. This problem is difficult to solve and scales poorly. We
believe that this problem is addressed by the following two methods:If a failure occurs within the PSN, this failure will normally be
detected by the PSN's IGP (link/node failure, link or BFD or IGP
hello detection), and the IGP convergence will naturally modify the
ECMP set of network paths between the Ingress and Egress PE's. Hence
the PW is only impacted during the normal IGP convergence time.If the failure is related to the individual corruption of an LFIB
entry in a router, then only the network path using that specific
entry is impacted. If the PW is load balanced over multiple network
paths, then this failure can only be detected if, by chance, the
transported OAM flow is mapped onto the impacted network path, or
all paths are tested. This type of error may be better solved be
solved by other means such as LSP self test .To troubleshoot the MPLS PSN, including multiple paths, the
techniques described in and can be used.The requirement to load-balance over multiple PSN paths occurs when
the ratio between the PW access speed and the PSN’s core link
bandwidth is large (e.g. >= 0.1). ATM and FR are unlikely to meet
this property. Ethernet does and this is the reason why this document
focuses on Ethernet. Applications for other high-access-bandwidth
PW’s (fiber-channel) may be defined in the future.This design applies to MPLS pseudowires where it is meaningful to
deconstruct the packets presented to the ingress PE into flows. The
mechanism described in this document promotes the distribution of flows
within the pseudowire over different network paths. This in turn means
that whilst packets within a flow are delivered in order (subject to
normal IP delivery perturbations due to topology variation), order is
not maintained amongst packets of different flows. It is not proposed to
associate a different sequence number with each flow. If sequence number
support is required this mechanism is not applicable.Where it is known that the traffic carried by the Ethernet pseudowire
is IP the method of identifying the flows are well known and can be
applied. Such methods typically include hashing on the source and
destination addresses, the protocol ID and higher-layer flow-dependent
fields such as TCP/UDP ports, L2TPv3 Session ID’s etc.Where it is known that the traffic carried by the Ethernet pseudowire
is non-IP, techniques used for link bundling between Ethernet switches
may be reused. In this case however the latency distribution would be
larger than is found in the link bundle case. The acceptability of the
increased latency is for further study. Of particular importance the
Ethernet control frames SHOULD always be mapped to the same PSN path to
ensure in-order delivery.If the payload of an Ethernet PW is made of a single inner flow (i.e.
an encrypted connection between two routers), then the functionality
described in this document does not give any benefits, though neither
does it give any drawbacks. This is unlikely to be a show-stopper for
two reasons:Firstly, the customer of a high-bandwidth PW service has
incentive to get the best transport service because an inefficient
use of the PSN leads to jitter and eventually to loss to the
PW’s payload.Secondly, the customer is usually able to tailor their
applications to generate many flows in the PSN. A well-known example
is massive data transport between servers which use many parallel
TCP sessions. This same technique can be used by any transport
protocol: multiple UDP ports, multiple L2TPv3 Session ID’s,
multiple GRE keys may be used to decompose a large flow into smaller
components. This approach may be applied to IPsec where multiple
SPI’s may be allocated to the same security association.A node within the PSN is not able to perform
deep-packet-inspection (DPI) of the PW as the PW technology is not
self-describing: the structure of the PW payload is only known to the
ingress and egress PE devices. The two methods proposed in this document
solve this limitation.The methods describe in this document are transparent to the PSN and
as such do not require any new capability from the PSN.The pseudowire generic security considerations described in and the security considerations applicable to a
specific pseudowire type (for example, in the case of an Ethernet
pseudowire apply.The ingress PE should take steps to ensure that the load-balance
label is not used as a covert channel.IANA is requested to allocate the next available values from the IETF
Consensus range in the Pseudowire Interface Parameters Sub-TLV type
Registry as a Load Balance Label indicator.The congestion considerations applicable to pseudowires as described
in and any additional congestion
considerations developed at the time of publication apply to this
design.The ability to explicitly configure a PW to leverage the availability
of multiple ECMP paths is beneficial to capacity planning as, all other
parameters being constant, the statistical multiplexing of a larger
number of smaller flows is more efficient than with a smaller number of
larger flows.Note that if the classification into flows is only performed on IP
packets the behaviour of those flows in the face of congestion will be
as already defined by the IETF for packets of that type and no
additional congestion processing is required.Where flows that are not IP are classified pseudowire congestion
avoidance must be applied to each non-IP load balance group.The authors wish to thank Joerg Kuechemann, Wilfried Maas, Luca
Martini, Mark Townsley, Kireeti Kompella and Shane Amante for valuable
comments and contributions to this design.