************************************************************************

      ATM Forum Document Number: ATM_Forum/98-0152R1

      ************************************************************************

      Title: Explicit rate control of TCP applications

      ************************************************************************

      Abstract:
      This contribution examines different methods of controlling the  rate
      of  TCP  applications to achieve the goals of fairness, delay control
      and throughput. The key idea is to use  an  ATM  ABR  algorithm  like
      ERICA+  to  calculate rate feedback and use new techniques to control
      TCP sources using the feedback calculated.  Specifically  we  propose
      two  rate-to-window  translation schemes for explicit window feedback
      to TCP and one variant of  an  acknowledgment  bucket  scheme  (which
      controls TCP rate by controlling the rate of acknowledgements). These
      techniques can be applied in ATM edge devices or in Internet routers.
      We  explore  the  dynamics  of TCP traffic under different methods of
      rate control and show through simulations  that  our  techniques  can
      considerably   improve  fairness,  throughput  and  end-to-end  delay
      properties of TCP applications.

      ************************************************************************

      Source:

      Ramakrishna Satyavolu, Ketan Duvedi, Shivkumar Kalyanaraman
      Department of ECSE,
      Rensselaer Polytechnic Institute,
      110, 8th Street, Troy, NY 12180-3590
      Phone: 518-276-8979, Fax: 518-276-2433, Email: shivkuma@ecse.rpi.edu


      The presentation of this contribution at the ATM Forum is sponsored by
      NASA Lewis Research Center.

      The research was supported in part by DARPA contract number:
      F30602-97-C-0274.


      ************************************************************************

      Date: February 1998

      ************************************************************************

      Distribution: ATM Forum Technical Working Group Members (AF-TM)

      ************************************************************************
      Notice:
      This contribution has been prepared to assist the ATM  Forum.  It  is
      offered  to  the Forum as a basis for discussion and is not a binding
      proposal on the part of any of the  contributing  organizations.  The
      statements  are  subject  to change in form and content after further
      study. Specifically, the contributors reserve the right  to  add  to,
      amend     or     modify     the    statements    contained    herein.
      ************************************************************************

      A  postscript  version of this contribution including all figures and
      tables has been uploaded to the ATM Forum ftp server in the  incoming
      directory.   The postscript version is also available through our web
      page:

            http://www.ecse.rpi.edu/Homepages/shivkuma
      ************************************************************************

      1. Introduction:
      ----------------

      There has been considerable interest  in  studying  the  dynamics  of
      legacy  TCP/IP  applications over ATM (see [1,2,3,4,5] and references
      therein).  These  studies  have  indicated  that   an   ABR   service
      implementation  can  provide low-delay, fairness, and high throughput
      inside the ATM network for  ftp-like  applications,  but  the  TCP/IP
      queues  build up at the edge of the ATM network. In contrast, the UBR
      and GFR service implementations allow traffic into the  ATM  network,
      and  if  buffering  is insufficient, tackle the issues of throughput,
      and fairness  through  intelligent  buffering,  scheduling  and  drop
      policies.

      The natural question which arises out of this is how  one  can  carry
      the  benefits  of  ABR  end-to-end,  i.e.,  provide better control of
      delay, throughput and fairness  of  TCP  connections.  An  innovative
      approach   suggested   recently   [4,5]   involves   maintaining   an
      "acknowledgment bucket" (also called "ack-bucket") at the edge device
      (or router) and releasing the acknowledgments (which in turn controls
      the rate of TCP) based upon the  ACR  of  the  VC  carrying  the  TCP
      connection.

      In this contribution, we first  study  a  more  direct  mechanism  to
      control  TCP  sources,  which  is  to control its maximum window size
      based upon the explicit rate calculated by  the  ERICA$+$  algorithm.
      As  described  later in this contribution, this mechanism can be used
      independently or in conjunction with  the  ``acknowledgment  bucket''
      mechanism.   In  this  contribution  we  study  the  feasibility  and
      benefits of  this  mechanism  (in  terms  of  throughput,  delay  and
      fairness), and its impact on TCP traffic dynamics.

      We also identify a variant of the acknowledgement bucket idea whereby
      the  acks are simply clocked out at the rate calculated by our ERICA+
      algorithm. References [4,5] also  use  a  rate  to  control  the  ack
      bucket.  The  ack-bucket  algorithm used in [4] attempts to model the
      TCP source's behavior in the router, and use the model to decide when
      to release the acknowledgements. This method may be restricted by the
      accuracy and complexity of the TCP source model used.  Reference  [5]
      uses  the  VC's  ACR  at  an end-system, and a simple queue-threshold
      based scheme to decide when  to  release  the  acknowledgements.  The
      queue-threshold  based  scheme  is  needed to compensate for the end-
      system queues which may accumulate. The fairness of the scheme is not
      clear  and  the  choice of the thresholds should be done carefully to
      avoid buffer overflows. Our approach is different in that  it  simply
      clocks  out  the  acks  at the explicit rate calculated by the ERICA+
      scheme [3,16]. The ERICA+ scheme  controls  the  queueing  delay  and
      calculates  a  fair  allocation  for  the contending sources. The ack
      bucket simply enforces the  fair  allocations.   We  also  study  the
      combination  of this scheme with the one of the TCP explicit feedback
      mechanisms.

      In the public Internet where writing to TCP  headers  may  not  be  a
      viable  option,  it  may  be possible to use rate-based techniques to
      control mechanisms scheduling, buffering  and  drop  policies.  These
      combinations will be studied in the future.


      2. TCP congestion control evolution
      -----------------------------------

      The TCP congestion  control  algorithm  [6]  is  used  to  provide  a
      reliable  end-to-end  transport  service.  The TCP congestion control
      protocol is extremely  robust  and  has  been  shown  to  work  under
      pathological  conditions. However it has certain well-known drawbacks
      relating to its performance, some of which are enumerated below:
      - Unfairness (a connection which sends more gets a larger window)
      - Little control over end-to-end delay characteristics
      - Late congestion detection (due to the implicit packet-loss-based
        detection technique)
      - Degraded throughput due to bursty loss of packets (due to triggering
        of the long TCP timeout)
      - Traffic synchronization effects
      - Burstiness of traffic depending upon reverse-path congestion.
      - Long ramp up times in long-delay bandwidth paths.

      The  Internet  Engineering  Task   Force   (IETF)   and   independent
      researchers   have  proposed  several  improvements  to  TCP/IP-based
      control at the transport  and  network  layers.   Proposed  transport
      layer  enhancements  include  the  fast retransmit and recovery (FRR)
      algorithms [7], selective acknowledgments  (SACK)  [8]  and  Explicit
      Congestion  Notification  (ECN,  one bit explicit feedback) [9].  The
      former enhancements (FRR, SACK) aim at improved  throughput,  timeout
      avoidance  and  quick  recovery  from burst losses, while ECN aims to
      improve   fairness   and   delay   characteristics   experienced   by
      connections.  Recently,  proposals to improve the startup dynamics of
      TCP (esp. for satellite links) have been  seen  [10].  Network  layer
      enhancements  are  aimed  at improving fairness and throughput. These
      include mechanisms  like  scheduling  [11]  and  and  packet  discard
      policies [12,13].

      These enhancements have been found to improve TCP  throughput,  avoid
      the  expensive  TCP  timeout,  and  react  well  during burst loss of
      packets. However, providing some form of fairness and  delay  control
      still  remain fairly open problems, though the ECN [9] proposal seeks
      to partially address it. Delay and fairness are two  important  user-
      perceived  metrics  in  WWW  performance  which would justify current
      interest in these metrics. Since  the  release  of  the  ATM  Traffic
      management  specification  4.0 [14] it has been realized that some of
      the ideas from ABR flow control could be applied to improve the delay
      and   fairness  characteristics  of  TCP.  Specifically,  ABR  switch
      algorithms like ERICA+  have  demonstrated  fairness  and  controlled
      delay characteristics using explicit rate feedback.

      In this contribution, we study the feasibility of  achieving  similar
      goals  in TCP (high throughput, fairness and low delay) by using rate
      control.  The  first  technique  we  use  involves  calculating  rate
      feedback just as in an ABR switch algorithm, but translating the rate
      into a window value to give feedback. This  is  a  direct  method  of
      controlling  the  rate  of  the  TCP source. We also consider another
      method of controlling TCP rate which is a variant of recent proposals
      [4,5]  called  the "acknowledgment bucket" (ack-bucket) scheme.  This
      method exploits the fact that TCP rate is controlled by the  rate  of
      receipt  of  its  acknowledgments.  More  accurately,  when the TCP's
      window size is fixed or  slowly  varying  (eg:  congestion  avoidance
      phase),  the  rate  of  packets sent into the network is equal to the
      rate of receipt of acknowledgments. The ack bucket  method  can  slow
      down  but  not  prevent  the  increase  of TCP window values. The TCP
      window controls the sum of unacknowledged  segments  in  the  forward
      path  and  the acknowledgements in the reverse path. We note that our
      explicit method of controlling TCP window can be used in  conjunction
      with  the  ack  bucket method. We study this issue further in section
      4.1.1.


      3. Procedure to Control TCP rate:
      -------------------------------

      Given  a  rate  r  calculated  by  a  max-min  fair   scheme   at   a
      switch/router/atm  end  system,  our goal is to make TCP send data at
      rate r. Our approach is to convert r into to  a  window  value  W  by
      multiplying  it  by  a time quantity representative of the round trip
      time, T.

      W = r*T

      We will see later (section 4.2.1) that  there  are  several  possible
      ways  of  choosing  T.  The  rate  r can be calculated using a switch
      algorithm like ERICA+ [16], or can be gotten by simply using ACR  (of
      an  ABR  VC).   This  window  value  W  is then used to limit the TCP
      congestion window variable. The feedback is given  via  the  receiver
      window field (Wr) in the TCP header [15]:

      Wr_new = Min (Wr_old, W)

      Further, the TCP checksum needs to be adjusted as follows:
      Delta = Wr_new - Wr_old     (one's complement subtraction)
      TCP_checksum = TCP_checksum + Delta    (one's complement addition)

      Note that the header modification does not violate the  TCP  protocol
      and  as  a result requires no standardization. Also note that even if
      the TCP packet is fragmented across several IP datagrams, the  window
      update  and  checksum  adjustment  needs  to  be  made  only in those
      fragments where the TCP header is  available.  Also  this  adjustment
      must  be  applied to every acknowledgment seen (including piggybacked
      acknowledgments), though the rate, r (if it is computed at the  node)
      need  not  be computed on a per-packet basis. Though we maintain some
      per-flow state in our method of rate computation, one could expect to
      build algorithms in the future with very little or no per-flow state.
      It may also be possible for routers/edge devices to  control  several
      TCP  flows  as  a  single  unit  (for  example,  all  flows from some
      particular domain can be rate controlled as a single unit, albeit  at
      the  loss  of accuracy of control). However, note that there are also
      some situations where the technique cannot be directly applied, e.g.,
      if  the IP payload is encrypted (e.g. using IPSEC), or authenticated.
      Also, since every acknowledgement needs to  be  marked,  route  flaps
      (due  to instability in routing algorithms) can result in undesirable
      oscillations.


      4. TCP Dynamics with Rate Control:
      ----------------------------------

      TCP uses window control to send packets into the network. In  general
      if  X(t)  packets  are  sent into the network in time interval t, the
      rate of the connection is dX(t)/dt. In the  case  of  window  control
      with a fixed size window, W, the rate of the connection in the steady
      state is equal to the rate of receipt of acknowledgments  (acks).  In
      the  case  of  TCP,  the window size increases by either 1 (or 1/W(t)
      during congestion avoidance) for the receipt of  every  non-duplicate
      ack.  As  a  result,  the  rate of the TCP connection in every RTT is
      either 2*W/T (W/T + 1) in every round trip time T,  where  W  is  the
      window size at the end of the previous RTT. At the end of the current
      RTT, the new window size is 2*W (or W+1).

      In other words, the TCP rate in the worst case may grow exponentially
      in  successive  RTTs.  Also  note that, when the TCP source is in its
      startup phase, its window size is small, and as  a  result  does  not
      receive  acknowledgments regularly - they are received in bursts. The
      rate of TCP as calculated above is the average rate over a period  of
      RTT.  The startup TCP rate measured over small intervals (e.g. at the
      router) could be much higher (during the burst) or lower (during idle
      periods)  but  should equal the average rate once the acknowledgments
      are received regularly (the "pipe"  is  filled).  Once  the  pipe  is
      filled,  the  ack  rate  could  still be irregular depending upon the
      queuing dynamics inside the network, and use of delay  ack  timer  at
      the  destination.  However,  for our discussion, we shall not involve
      the effect of these latter factors.

      In the following sections we will considering two ways of controlling
      the  TCP  rate:  a)  by  limiting  the maximum window size of the TCP
      source, and b) by controlling the rate  of  acknowledgments  (without
      explicitly  writing  to  the  window  field).  We  will also consider
      combinations of these two methods.


      4.1 TCP Dynamics with Explicit Feedback Control
      -----------------------------------------------

      Firstly, we recall that our the window adjustment technique described
      in  section  3  controls only the maximum receiver window (Wr), which
      acts as an upper bound on the TCP congestion window  (CWND)  variable
      at  the source. The method would not have any effect if a) CWND < Wr,
      or b) Wr < W  (feedback  value  larger  than  receiver's  max  window
      itself).  Note  that while case b) is a permanent condition where the
      feedback calculated is ineffective, case a) can be assumed  to  be  a
      transient condition.

      Specifically,  in  case  a),  CWND  is  initially  small  (slow-start
      increase). But, once CWND > Wr, the window available to the source is
      Wr. In this state, sudden increases in Wr (due to fluctuations in the
      available  rate  r)  may  result  in a sudden burst of packets in the
      network. Similarly a sudden decrease in Wr may result in an  extended
      idle  period  (because  the source is waiting for the acknowledgments
      for excess packets sent when the window was higher).  Even if  Wr  is
      constant  (r  is  in the steady state), the rate of packets sent into
      the network depends upon the rate at which  the  acknowledgments  are
      received.  The  rate  of acknowledgments depends upon factors such as
      the TCP source rate, the queue lengths  (which  may  fluctuate),  and
      delay  ack timer at the TCP destination (we ignore ack bucket control
      for this discussion). As a result the effective rate of  TCP  depends
      both  upon the changes in the maximum receiver window Wr (affected by
      our scheme), and the rate of acknowledgments received (dependent upon
      queuing dynamics, also partially dependent upon our scheme).

      We describe two methods of controlling the receiver window  Wr  based
      upon a rate r.


      4.1.1 Two methods for Translating Rate Feedback into Window Feedback
      --------------------------------------------------------------------

      In this section, we consider two ways of choosing T, the  time  value
      used  to  translate  r  to W. One approach is to treat the advertised
      rate value as an average allocation over the  period  of  the  flow's
      round-trip  delay, Ti, and calculate the window as the product of the
      rate, r, and the flow's round trip delay, Ti. The second approach  is
      advertise a window value based upon a single time estimate T.


      The first approach, called "translation using individual source  RTT"
      requires  the  round  trip  delays  of individual sources. The window
      value given as feedback indicates the ideal  delay-bandwidth  product
      per  connection,  where the delay is Ti, and the bandwidth is r. Once
      the sources' congestion window values reach r*Ti,  and  the  flow  of
      acks  back  to  the  source  is  regular,  the allocation is fair and
      efficient. Note however that the rate fluctuations in r, would result
      in  magnified  window  changes  in  larger  RTT  connections.  So, as
      described in section 5 we use a conservative  increase  technique  to
      avoid  such  burstiness.  We show using simulations in section 6 that
      this scheme can achieve fair and efficient allocations, with  minimal
      buffer  requirements and queuing delays.  As a final point, note that
      one way of determining round trip time of a  connection  is  to  send
      ICMP echo message to both source and destination.

      The second approach, called "translation using a  fixed-RTT,"  is  to
      advertise  a  single  window  value based upon a time estimate T. The
      simplest approach is to choose  a  constant  T.  Our  simulations  in
      section  6  show  that  while  this method can control delay to small
      values, it does not eliminate  unfairness.  However  its  worst  case
      performance  in  terms of fairness is better than vanilla TCP in that
      it allows  every  connection  to  get  a  non-trivial  share  of  the
      bandwidth (though not necessarily the fair share).

      The reason  for  the  unfairness  in  the  fixed-RTT  translation  is
      explained  as  follows. We are choosing a single value T to translate
      the rate feedback for heterogeneous RTT connections. If T < Max  RTT,
      the  largest RTT source i is limited to a window of r*T which is less
      than r* (Max RTT). As a result, source i cannot  send  at  a  rate  r
      continuously  since  the  acks cannot be sustained at a rate r over a
      round trip time of Max RTT.  If T >= Max RTT, the above  problem  can
      be  overcome,  i.e.,  it  is  possible  to  achieve a state where the
      largest RTT connection i finally gets acks back at a rate r. However,
      the  distribution  can  still  be  unfair  because  of  the following
      situation. All sources now get a fixed window size r*T, but a smaller
      RTT  connection  j  could transmit its entire window (r*T) within its
      RTT (RTTj), i.e. it sends traffic at a rate  r*T/RTTj.  However,  the
      largest  RTT connection i in this state can achieve a maximum rate of
      only r*T/RTTi = r*T/(Max RTT), which is  smaller  than  the  rate  of
      connection i, an unfair situation.

      We note that for this latter case i.e., T >= Max RTT,  a  simple  ack
      bucket  scheme  can be used in conjunction with the explicit feedback
      scheme to achieve fairness.  Specifically,  consider  an  ack  bucket
      scheme  enforces the rate of acks to be equal to r which augments the
      feedback scheme and the window values of sources Wi, Wj  are  greater
      than  r*Max  RTT. We now have a condition where the ack bucket scheme
      is working with fixed  window  sources,  and  regular  ack  flow.  As
      explained  later in section 4.3, this can lead to a fair distribution
      of rates. As a final point, note that  the  buffer  requirements  for
      this  method  (Buffer  = Sum (r*Max Ti)) will be larger than that for
      the first method, where we use individual RTT values Ti for  feedback
      calculation  (Buffer  =  Sum  (r*Ti)).   Now the fairness achieved by
      using the acknowledgement bucket in  addition  to  explicit  feedback
      leads us to investigate the former in its own right.

      4.3 TCP Dynamics with Ack Bucket Control
      ----------------------------------------

      The TCP dynamics under ack-bucket control can be understood by  using
      a  simple  ack-bucket  which,  when given a rate r, releases ("clocks
      out") the acks at a rate r. Assume that  there  are  two  connections
      passing  through  the  router  which  each get a rate r, and that one
      connection has a small RTT (RTT1) and the other a large  RTT  (RTT2).
      Further,  assume that both connections currently have the same window
      size W such that r*RTT2 >> W >> r*RTT1.  In this state, connection  2
      is  idle  most  of  the  time, whereas connection 1 is receiving acks
      continuously. If both connections are  in  the  exponential  increase
      phase, connection 1 would increase its window at rate r continuously,
      whereas connection 2 would increase its window at  a  rate  r  for  a
      period W/r for every round trip (RTT2). Hence connection 1 sends more
      packets  through  than  connection  2  during  this  transient  phase
      (temporary unfairness).

      Now once the long RTT pipe is  filled  (i.e.  connection  2's  window
      grows  to r*RTT2), it gets acks regularly and increases its window at
      a rate r (like connection 1). Since both connections  are  increasing
      their rate of transmission exponentially, the queue increases at rate
      2r. An additional mechanism is necessary to quell the  queue  growth.
      In general, controlling the rate of acks to the ACR at the end system
      is insufficient, because it does not take  into  account  the  source
      queueing  point.  A separate algorithm is necessary to take that into
      account.

      Reference [5] proposes a queue threshold based algorithm  to  control
      the  ack  release  rate  (basically  a  simple  ABR switch congestion
      control  algorithm)  which  reacts  upon  queue  length  crossing   a
      threshold  by  cutting down the rate of acks by a constant factor, H.
      Observe that it is necessary that H >= 2 to guarantee stop  of  queue
      growth in the above scenario. In general, the correct rate under such
      conditions can  be  better  calculated  by  a  explicit  rate  switch
      algorithm  like  ERICA+  [16]. Hence, our proposal is to use a simple
      ack-bucket, where the acks are clocked out at the rate calculated  by
      an  explicit  rate algorithm like ERICA+. We demonstrate in section 6
      that this method can result in fair allocations.

      Observe, however, that the simple ack-bucket scheme (without  ERICA+)
      can  achieve fairness if the TCP source windows are kept constant and
      the window of source i, Wi, is no less  than  r*RTT(i).  Under  these
      conditions, acks return continuously to both sources (irrespective of
      RTT) at rate r, and since the window Wi is constant, each ack  simply
      clocks  a  new  packet  into the network. In other words, the rate of
      both connections is now r, a fair distribution. The ack bucket scheme
      would also be near-fair in the above situation if the TCP sources are
      in their  congestion  avoidance  phase  where  the  window  increases
      slowly.  One  of  our  proposals  here  involves showing how a simple
      explicit feedback scheme (the fixed-RTT method described  in  section
      4.1.1)  can  augment  the  ack-bucket  scheme  to maintain a constant
      window at TCP sources (and as a result achieve fairness).


      5. Simplified ERICA algorithm for Explicit Rate Calculation
      -----------------------------------------------------------

      We  use  the  ERICA+  algorithm  as  described  in  [16,3]  for   our
      simulations.  This  is  a  well-known  algorithm  in  the ATM Traffic
      Management community, which achieves  the  goals  of  fairness,  high
      utilization,  and  low delay using rate-based control. The key change
      we make to the algorithm is to limit rate increases  to  10%  of  its
      previous  value.  This step, which is similar to the limit imposed by
      the RIF parameter, is done to prevent burstiness which can result due
      to corresponding window changes at the source (see section 4.1.  Rate
      decreases are not limited. The source  rate  of  each  connection  is
      measured  at  the  end  of an averaging interval and the new feedback
      rate is calculated at that time.  All of the calculations are done at
      the  end  of  the averaging interval; no rate updates are made in the
      middle of averaging intervals. The resulting algorithm is:


      Initialization:
      ===============

      FairShare = 0;
      MaxAllocPrevious =  MaxAllocCurrent = FairShare;

      End of Interval:
      ===============

      Measure Total Available Capacity
      Measure Input Rate
      Target Capacity <- f(Q)* Total Available Capacity.
      z <- Input Rate/Target Capacity
      Measure number of active sources
      FairShare = Target Capacity/Number of active sources

      FOR each session i DO
              Rate[i] = Num Packets In Interval [i]/Interval Length
              VCShare <-  Rate[i]/z
              IF (z > 1+ delta) ER[i] = Min(Max(VCShare, FairShare), 1.1*ER[i])
              ELSE ER[i] = Min(Max (MaxAllocPrevious, VCShare), 1.1*ER[i])
              MaxAllocCurrent = Max (MaxAllocCurrent,ER[i])
              IF (ER[i] > FairShare AND Rate[i] < FairShare)
              THEN ER <- FairShare
      ENDFOR
      MaxAllocPrevious = MaxAllocCurrent
      MaxAllocCurrent = FairShare

      TCP Acknowledgment Processing:
      ==============================

      Wr_new <- Min(Wr_old, ER[i]*T[i])    // T[i] is either a fixed common value,
                                           // or a per-connection value
      Delta = Wr_new - Wr_old              // one's complement subtraction
      TCP_checksum = TCP_checksum + Delta   //one's complement addition

      Note that other simplifications to ERICA+  are  possible,  especially
      for  multicast  configurations  where  it is undesirable to measure N
      [17].

      6. Simulation Results
      ----------------------

      The simulations were run  using  the  following  parameters.  For  an
      explanation of ERICA+ parameter values, please refer [16]

      TABLE 1. Parameters for Simulation
      ==================================
      Parameter                               Value
      ========                                =====
      Simulation time                         5 s
      Switch Avging Interval (ERICA+)         5 ms
      Target Queueing Delay  (ERICA+: T0)      1.5 ms
      Queue Drain Limit Factor (ERICA+: QDLF)  0.5
      a (ERICA+)                              1.15
      b (ERICA+)                              1
      Switch RTT estimate                     30 ms
      Link speed                              155.52 Mbps
      Maximum Segment Size (TCP: MSS)         1024 bytes
      SSThresh (TCP)                          600000 bytes


      6.1 Simulation Results with a Heterogeneous RTT Configuration:
      -------------------------------------------------------------


      We use the configuration shown in Figure 1 for studying heterogeneous
      RTT  cases.  The  TCP  sources  named STCP #1..15 are the sources for
      packets and the TCP connection objects  names  DTCP  #1..15  are  the
      corresponding  sinks for the TCP sources. There are three groups of 5
      sources each. The first group has access links of 3 km  each  (or  an
      RTT  of 2012 km = 10 ms). The second group has access links of 300 km
      each (or an RTT of 3200km = 16ms). The third group has  access  links
      of 3000km (or an RTT of 14000 km = 70 ms).


       ----------                                             -----------
      |STCP 1-5  |                                          /| DTCP 1-5  |
       ---------- \3 km          (1000 km)            3 km/  -----------
                   \     --------  LINK1  -------- ______/         -------------
       ----------    ----|  R1    |-------|  R2    |________300km__|  DTCP 5..10 |
      |STCP 5-10 |-300km-|        |       |        |_______         -------------
       ----------  /-----  --------         --------        \      .
                  /3000km                                   \3000km
       ----------                                            \------------
      |STCP 10-15 |                                           | DTCP 10-15 |
       ----------                                              ------------


               Figure 1: The NxN Heterogeneous RTT Configuration
               -------------------------------------------------


      We conduct five experiments with this configuration:

      - In the first experiment (Figure 2) routers do not use explicit
        feedback (also called vanilla TCP).

      - In the second experiment (Figure 3), the routers use the fixed-RTT
        translation method to calculate the window feedback from the rate
        feedback for all sources. The fixed value chosen is 70 ms, which is
        the maximum RTT of any connection in this configuration.

      - In the third experiment (Figure 4) we use the translation using
        individual source RTTs. We assume that the routers  have  knowledge
      of
        the round trip delays, Ti for each of  the  sources  and  use  that
      value
        to translate rate feedback to window feedback.

      - In the fourth experiment (Figure 5) we use the ack-bucket scheme
        which simply clocks out the rate of acks based upon the rate
        calculated by ERICA+.

      - In the fifth experiment (Figure 6), we combine the fixed-RTT
        translation with the ack-bucket scheme.


      (figure 2: see postscript file)


      6.1.1 Performance of rate-to-window translation techniques
      -----------------------------------------------------------

      Figure 2(a) shows the TCP source's "actual window size" which is  the
      minimum  of the TCP congestion window (CWND), and the receiver window
      (Wr). Since the receiver window is constant at its initial  value  of
      600000 (no TCP rate control), this graph is no different from a graph
      of CWND vs  time.  The  window  sizes  increase  exponentially  (each
      increase  spurt  is  twice  the  size  of the previous one). However,
      between each increase spurt, there is a time gap which indicates that
      the  source  is  waiting  for  new  acknowledgments after sending the
      current spurt of traffic. This time the gap  increases  exponentially
      as  well  because  the  first packet of the new spurt needs to travel
      though a larger queue to reach its destination.

      Figure 2(b) and 2(c) show the  queue  length,  and  link  utilization
      respectively,  while figure 2(d) shows sender sequence numbers of the
      various TCP sources. The round trip times measured at the source (not
      shown)   increase   (due   to  queue  buildup  in  the  network)  and
      correspondingly slow the window increase  by  reducing  the  rate  of
      acknowledgments  returned  to  the  sender. The queue lengths reach a
      maximum of about 8000 packets and round trip times  measured  at  the
      source  (not shown) went up to 600 ms. These quantities stabilized at
      this point as the sources reached the SSTHRESH (and receiver  window)
      value of 600000 bytes and the congestion window increases no further.

      (figure 3: see postscript file)

      Figure 3 shows a simulation of the same configuration where  explicit
      feedback  control  is used.  Notably, figure 3(b) shows queue lengths
      which oscillate  between  about  600  and  750  packets  (an  earlier
      experiment  with  fixed  RTT  =  30 ms showed even lower queues). The
      actual window size shown  in  figure  3(a)  is  now  limited  by  the
      receiver window value (feedback) and not the actual congestion window
      value. Observe that the actual window size oscillates  between  60000
      and  80000  bytes  (with  RTT = 30ms, the values were 30000 and 40000
      bytes).

      The queue length oscillation and the actual window  size  oscillation
      are  related  due  to two reasons: a) ERICA+ is sensitive to queueing
      delays above the T0 parameter [16] (1.5  ms),  and  b)  the  rate  of
      acknowledgments  which  determines  rate of change of window size and
      introduction of new packets is controlled by  current  queue  levels.
      The  queue  oscillations indicate that the rate-to-window translation
      mechanism cannot fully control the rate of  the  TCP  sources  (other
      factors  like  the  rate of acks also play a role). However the delay
      performance (and buffer requirements) of the system display a  marked
      improvement  with  rate  feedback. Also note that the sequence number
      increase in figure 3(d) is much more smoother, which means  that  the
      burstiness of traffic is reduced.

      The graphs in Figure 3 show that the distribution of rate  is  unfair
      (see  slopes of curves in figure 3(d)). We show in a later experiment
      that fairness can be achieved by adding ack-bucket  control  to  this
      method.   Note  however,  that  if  a  smaller  RTT were used for the
      translation (eg fixed RTT = 30 ms), even the addition  of  ack-bucket
      control does not achieve fairness because the long RTT sources cannot
      sustain the desired rate in an interval of length max RTT $=$ 70 ms).


      However note that this distribution, though unfair is an  improvement
      over  the  TCP worse case where a connection can be almost completely
      starved. The overall utilization (figure 3(c)) is  high  because  the
      smaller  RTT  connection  grabs  the  available capacity. Queue sizes
      (figure 3(b)) are controlled due to  the  queue  control  feature  of
      ERICA+  [3,16]  and  the  fact  that  window  sizes  are limited. The
      oscillations in actual window sizes (figure 3(a)) and  queue  lengths
      (figure 3(b)) which are interrelated by the rate of ACKs returning to
      the sources. We will examine combinations of ack  bucket  and  single
      time  value based translation in future contributions, which can help
      smooth these oscillations and improve fairness.


      (figure 4: see postscript file)

      Figure  4  shows  results  with  the  translation  scheme  where  the
      individual RTT (Ti) is used instead of a single value T.  As expected
      the allocations are fair (see figure 4(d)),  and  the  actual  window
      sizes  (figure  4(a)) are proportional to their RTTs. Queue sizes are
      also small (figure 4(b)), and utilization  is  not  compromised  much
      (figure 4(c)).


      6.1.2 Performance of the Ack-bucket techniques
      ----------------------------------------------

      (figure 5: see postscript file)

      Figure 5 shows results where  the  ERICA+  algorithm  calculates  the
      explicit  rates,  but the acks are clocked out at the rate calculated
      by ERICA+, bypassing the rate-to-window translation.  The performance
      is  excellent,  as  expected  from the discussion in section 4.3. The
      ERICA+  algorithm  calculates  the  rate  based  upon  not  only  the
      available  capacity,  but  also the input rates and queue lengths. As
      discussed in section 4.3, this allows the  ack-bucket  scheme  to  be
      fair,  efficient  and control queues. Specifically, the queue lengths
      in figure 5(b) are controlled to a low value of 50 packets, while the
      sequence  number plot in figure 5(d) show the fair distribution. Note
      however that the window sizes are not controlled i.e., the congestion
      window  is  the same as the actual window, and they rise to a maximum
      of 600000 bytes.

      Observe that the congestion window size is the sum of the packets  in
      the  forward  direction (controlled to small values by the ack-bucket
      method  with  ERICA+),  and  the  acknowledgements  in  the   reverse
      direction   (which  is  large,  but  is  stored  efficiently  at  the
      intermediate router - only the ack numbers are  stored  in  an  array
      [4,5]).  Note  that  piggybacked  acknowledgements  cannot  be stored
      (because the entire packet needs to be stored).  In  this  case,  the
      latest  acknowledgement number is read from the packet into the array
      (the ack-bucket), and the last acknowledgement  number  sent  to  the
      source  is  overwrites the acknowledgement number field. The checksum
      needs to be recalculated in this case as described in section 4. Note
      that a duplicate piggybacked acknowledgement is not considered by TCP
      as  a  duplicate  acknowledgement  and  will  not  trigger  the  fast
      retransmit/fast  recovery  algorithms.  Note  this  this method would
      share some of the problems of explicitly writing  the  window  field,
      viz.,  cannot work with IPSEC or authentication, requires all acks to
      flow in the same path as the forward data flow.

      The problem of storing large number of acknowledgement numbers in the
      router   can   be  controlled  by  combining  the  two  schemes  i.e,
      translation of rate-to-window using a fixed RTT, and controlling  the
      acks using the acknowledgement bucket with ERICA+. Figure 6 shows the
      result  of  our  next  experiment  which  combines  the   fixed   RTT
      translation scheme and the ack-bucket control with ERICA+.

      (figure 6: see postscript file)

      Notably, the unfairness problem in  the  fixed-RTT  translation-alone
      scheme  is solved (figure 6(d)). Further, the drawback of ack-bucket,
      viz. the high congestion window values, is also taken care of (figure
      6(a)). The queue lengths oscillate (figure 6(b)), but are kept to low
      levels, while the average utilization  is  high  despite  some  drops
      (figure 6(c)).


      6.2 Performance of the Ack-bucket techniques
      ----------------------------------------------

      (figure 7: see postscript file)

      (figure 8: see postscript file)

      Simulations were performed on the GFC-2 configuration (Figure 7,  see
      also [16]) to test the fairness of the different translation methods.
      The results are shown in figure 8. Figure 8(a) shows the distribution
      of  sender sequence numbers for the case when no explicit feedback is
      used. While not much unfairness is seen, the queues (not shown)  grow
      to  large  levels.  Figure 8(b) shows the situation using a fixed-RTT
      translation (RTT = 130 ms, the largest RTT). The distribution is  not
      fair,  though the queue lengths are low (not shown). Utilizations are
      also somewhat lower in this case (not shown). However, the case where
      the  translation  is based on actual RTTs (figure 8(c)) shows a near-
      fair distribution. Queue lengths are low and  utilizations  are  high
      too   (not   shown).  Figure  8(d)  shows  the  case  for  using  the
      acknowledgement bucket, and figure 8(e) shows the combination of  the
      fixed-RTT  translation  scheme  and the ack-bucket. These achieve the
      fair rates too.

      7. Summary and Conclusions
      -------------------------


      We have proposed a new method  of  conveying  rate  feedback  to  TCP
      sources,  by  translating  it  into  a  window  value and writing the
      feedback in TCP header  of  acknowledgments.  This  method  does  not
      require any changes to TCP header formats or the protocol, and can be
      implemented in both ATM edge devices/switches, as well as in Internet
      routers. We have also considered a variant of a known method for rate
      control, called the acknowledgement bucket which controls the rate of
      TCP connections by controlling the rate of the acknowledgements.

      Our key contributions in this work  are  to  demonstrate  that  rate-
      controlled   TCP  sources  can  achieve  fairness  and  better  delay
      characteristics in addition to high throughput performance.  Further,
      we have studied the dynamics of rate-controlled TCP extensively using
      both our method of explicit feedback and  the  method  of  ack-bucket
      control.  Future  work  in this direction will include simpler switch
      algorithms, combination of simple rate-to-window translation  schemes
      with  ack  bucket  control schemes, and studies with limited buffer &
      drop policies such as RED, EPD  etc.  Translating  the  rate  into  a
      single bit feedback through ECN will also be considered.


      8. References:
      ---------------

      [1] Shiv Kalyanaraman, Raj Jain, Sonia Fahmy, Rohit Goyal,
         "Performance and Buffering Requirements of Internet Protocols over
         ATM ABR and UBR Services," to appear IEEE Communications Magazine,
         March/April 1998.

      [2] H. Li, K.Y. Siu, H.T. Tzeng, C.Ikeda and H. Suzuki "A simulation
          study of TCP performance over ABR and UBR service in ATM
          networks," Proceedings of INFOCOM'96.

      [3] S. Kalyanaraman, "Traffic Management for the Available Bit Rate
          (ABR) Service in Asynchronous Transfer Mode (ATM) networks"
          Ph.D. Dissertation, Dept. of Computer and Information Sciences,
          The Ohio State University, August 1997.

      [4] Paolo Narvaez and Kai-Yeung Siu, "An Acknowledgment Bucket Scheme
          for Regulating TCP Flow over ATM," to appear in Computer Networks
          and ISDN Systems Special issue on ATM Traffic Management, 1998.

      [5] Arata Koike, "TCP flow control with ACR information," ATM
          Forum/97-0998, December 1997.

      [6]  V. Jacobson, "Congestion Avoidance and Control," Proceedings  of
      the SIGCOMM'88 Symposium, pp. 314-32, August 1988.

      [7] Stevens, W. R., "TCP Slow Start, Congestion Avoidance, Fast
           Retransmit, and Fast Recovery Algorithms," January 1997.

      [8] Mathis, M., J. Mahdavi, S.  Floyd,  A.  Romanow,  "TCP  Selective
      Acknowledgement Options," Internet RFC 2018, October 1996.

      [9] Ramakrishnan, K.K., Floyd, S., "A proposal to add Explicit
          Congestion Notification (ECN) to IPv6 and to TCP," IETF Internet
          Draft, November 1997, Available as
          http://ds.internic.net/internet-drafts/draft-kksjf-ecn-00.txt

      [10] Floyd, S., Allman, M., and Partidge, C., "Increasing TCP's
          Initial Window," IETF Internet Draft, July 1997, Available as
          http://ds.internic.net/internet-drafts/draft-floyd-incr-init-
      win-00.txt

      [11] Demers, A., S. Keshav, and S. Shenker,
         "Analysis and Simulation of a Fair Queueing Algorithm,"
         Internetworking: Research and Experience,  Vol. 1, 1990, pp. 3-26.

      [12] Romanow, A., and S. Floyd, "Dynamics of TCP Traffic over ATM
         Networks," IEEE Journal on Selected Areas in Communications,
         Vol. 13, No. 4, May 1996.

      [13] Floyd, S., and V. Jacobson, "Random Early Detection Gateways for
         Congestion Avoidance," IEEE/ACM Transactions on  Networking,  Vol.
      1,
         No. 4, August 1993, pp.397-413.

      [14] ATM Forum Traffic Management, "The ATM Forum Traffic Management
         Specification Version 4.0," April 1996.  Available as
         ftp://ftp.atmforum.com/pub/approved-specs/af-tm-0056.000.ps

      [15] W.R. Stevens, "TCP/IP Illustrated, Volume 1," Addison-Wesley,
          1994.

      [16]  R.  Jain,  S.  Kalyanaraman,  R.  Goyal,  S.  Fahmy,   and   R.
      Viswanathan,
           "The ERICA Switch Algorithm for ABR Traffic Management in ATM
            Networks" IEEE Transactions on Networking, submitted,  November
      1997.

      [17] Sonia Fahmy, Raj Jain, Rohit Goyal, Bobby Vandalore, "A switch
           algorithm for ABR multipoint-to-point connections," ATM
           Forum/97-1085, December 1997,

      All our papers and ATM  Forum  contributions  are  available  through
      http://www.cis.ohio-state.edu/~jain/