Language
Home
About Us
IT Outsourcing
Network
Call Center
Mail Service
Cases
Contacts

Case
Network Project
IT Outsourcing

Problem


We received email with from Randy Sobie expressing concerns over the data transfer rates from U Victoria in British Columbia, Canada to SLAC

Hi Les
We spoke a while ago about connectivity between SLAC and Victoria.
In the new year UVictoria will be hosting one of the babar working data sets (tau working group).
This will likely require the transfer of ~10 TB of data in Feb-Mar.Currently we can transfer data to SLAC (we produce babar MC samples) and we copy analysis data
to UVic at rates around 30 mbits/s using a BbkImport command of babar.
This command uses
bbftp and 10 streams.
We are wondering how we can improve the transfer rate. Our routing to SLAC from
Victoria (mercury.uvic.ca) is unfortunately not direct and we go to Calgary before entering the US.I have cc'ed this email to Ian Gable who is the HEPNET/Canada technical manager,
Thomas Tam of CANARE and Ashok Agarwal of UVic (Ashok does most of the data transfer).
We recently set up a lightpath between UVic and TRIUMF. Some of our colleagues in
Vancouver have a lightpath to Fermilab. I am told that a lightpath solution may not
be possible in the short-term.

We would welcome your feedback and suggestions on how we improve our connectivity to SLAC.
Regards, Randy

The hosts involved at SLAC were noric.slac.stanford.edu and bbr-xfer06.slac.stanford.edu and at UVic according to Ashok Agarwal the host is called mercury.uvic.ca. Ian Gable provided an account on phys02.comp.uvic.ca. The Norics and phys02.comp.uvic.ca have 1Gbits/s NICs, have iperf 2.0.2 with pthreads, and have the out of the box TCP window configurations:
########################################
# TCP windows/buffer information #
########################################
cat /proc/sys/net/core/wmem_max = 131071
;cat /proc/sys/net/core/rmem_max = 131071
;cat /proc/sys/net/core/rmem_default = 65535
;cat /proc/sys/net/core/wmem_default = 65535
;cat /proc/sys/net/ipv4/tcp_rmem = 4096 87380 174760
;cat /proc/sys/net/ipv4/tcp_wmem = 4096 16384 131072
;cat /proc/sys/net/ipv4/tcp_mem = 195584 196096 196608

The Norics are a cluster of dual 1.8GHz AMD Opteron cpu host running Linux 2.4.21-47.0.1.ELsmp #1 SMP. They are targeted at supporting interactive usage. Phys02.comp.uvic.ca is a dual 3.2GHz Intel Xeon cpu host running Linux 2.4.21-32.ELsmp #1 SMP. Mercury is actually two hosts nercury2 and mercury5.uvic.ca. They are 3.2 GHz hosts running Linux 2.4.21-32.0.1.ELsmp #1 SMP and according to Brian Fitzgerald [bfitz@uvic.ca] they have 1 GE NICs.
Traceroutes
The traceroute from SLAC to UVic and pinging along the route appears as:
Matt's traceroute [v0.52]
iepm-bw Thu Nov 30 17:10:08 2006
Keys: D - Display mode R - Restart statistics Q - Quit
Packets Pings
Hostname %Loss Rcv Snt Last Best Avg Worst
1. rtr-iepm-test.slac.stanford.edu 0% 1336 1336 0 0 0 46
2. rtr-core1-p2p-iepm.slac.stanford.edu 0% 1336 1336 0 0 0 29
3. rtr-dmz1-ger.slac.stanford.edu 0% 1336 1336 0 0 1 199
4. slac-rt4.es.net 0% 1336 1336 0 0 1 41
5. slacmr1-slacrt4.es.net 0% 1335 1335 0 0 8 243
6. snv2mr1-slacmr1.es.net 0% 1335 1335 0 0 10 352
7. snv2sdn1-snv2mr1.es.net 0% 1335 1335 0 0 1 41
8. seattlesdn1-snv2sdn1.es.net 0% 1335 1335 18 18 18 38
9. canet-1-lo-jmb-706.sttlwa.pacificwave.net 0% 1335 1335 40 40 42 245
10. c4-bcnet.canet4.net 0% 1335 1335 61 61 62 224
11. UVICA-ORAN.BC.net 1% 1334 1335 63 63 64 286
12. csc1cled050.bb.uvic.ca 1% 1334 1335 63 63 64 120
13. dmc2ell005.bb.uvic.ca 1% 1334 1335 63 63 64 130
14. ???

Packet loss is better than 0.1% (1 packet lost in 1334).
The traceroute from UVic to SLAC is:

[cottrell@phys02 cottrell]$ traceroute noric.slac.stanford.edu
traceroute to noric05.slac.stanford.edu (134.79.86.55), 30 hops max, 38 byte packets
1 142.104.21.62 (142.104.21.62) 0.334 ms 0.175 ms 0.166 ms
2 csc1cled050.bb.uvic.ca (142.104.252.165) 0.251 ms 0.272 ms 0.219 ms
3 emc1cled050.bb.uvic.ca (142.104.252.246) 0.381 ms 0.396 ms 0.361 ms
4 gigapop-ORAN.hc.BC.net (207.23.240.177) 2.522 ms 2.537 ms 2.515 ms
5 c4-cal01.canet4.net (205.189.32.194) 23.250 ms 23.419 ms 23.047 ms
6 esnet-1-lo-jmb-706.sttlwa.pacificwave.net (207.231.240.13) 45.533 ms 45.358 ms 45.241 ms
7 snv2sdn1-seattlesdn1.es.net (134.55.218.26) 63.138 ms 63.169 ms 63.056 ms
8 snv2mr1-snv2sdn1.es.net (134.55.207.38) 63.440 ms 63.162 ms 62.998 ms
9 slacmr1-snv2mr1.es.net (134.55.217.1) 63.604 ms 63.513 ms 63.347 ms
10 slacrt4-slacmr1.es.net (134.55.209.94) 63.632 ms 63.688 ms 68.404 ms
11 rtr-dmz1-vlan400.slac.stanford.edu (192.68.191.149) 63.736 ms 63.531 ms 63.843 ms
12 * * *
13 noric05.slac.stanford.edu (134.79.86.55) 63.684 ms 63.520 ms 63.431 ms

Available Bandwidth
We used pathneck to look at the available bandwidth bottleneck from SLAC to UVic. Unfortunately as installed at SLAC pathneck requires root priviledges so I cannot execute it on the norics, and it is not compiled to run on a Solaris host. I therefore ran it from iepm-resp which is a two cpu Pentium 2 GHz with a 1GE NIC running Linux 2.4.21-47.0.1.EL #1. The results from iepm-resp to phys02.comp.uvic.ca are shown below:
1cottrell@iepm-resp:~>sudo /afs/slac.stanford.edu/package/netperf/bin/@sys/pathneck -x phys02.comp.uvic.ca
Kerberos Password:
1165080082.834362 142.104.21.13 500 60 0

00 0.305 134.79.243.1 844 rtr-iepm-test.slac.stanford.edu
01 0.279 134.79.252.5 823 rtr-core1-p2p-iepm.slac.stanford.edu
02 0.278 134.79.135.15 855 rtr-dmz1-ger.slac.stanford.edu
03 0.451 192.68.191.146 730 slac-rt4.es.net
04 0.639 134.55.209.93 534 slacmr1-slacrt4.es.net
05 0.925 134.55.217.2 622 snv2mr1-slacmr1.es.net
06 0.923 134.55.207.37 668 snv2sdn1-snv2mr1.es.net
07 18.375 134.55.218.25 751 seattlesdn1-snv2sdn1.es.net
08 40.681 207.231.240.3 699 canet-1-lo-jmb-706.sttlwa.pacificwave.net
09 61.308 205.189.32.193 643 c4-bcnet.canet4.net
10 63.354 207.23.240.189 584 UVICA-ORAN.BC.net
rtt = 63.696 ( 142.104.21.13 )

We also ran a pathneck to mercury.uvic.ca to see if it has a similar bottleneck (we were unsure what the NIC speed was on mercury) with following result:
5cottrell@iepm-resp:~>sudo /afs/slac.stanford.edu/package/netperf/bin/@sys/pathneck -x mercury.uvic.ca
1165080632.815491 142.104.193.217 500 60 0

00 0.336 134.79.243.1 789 rtr-iepm-test.slac.stanford.edu
01 0.246 134.79.252.5 818 rtr-core1-p2p-iepm.slac.stanford.edu
02 0.246 134.79.135.15 872 rtr-dmz1-ger.slac.stanford.edu
03 0.336 192.68.191.146 797 slac-rt4.es.net
04 0.502 134.55.209.93 615 slacmr1-slacrt4.es.net
05 0.887 134.55.217.2 634 snv2mr1-slacmr1.es.net
06 0.886 134.55.207.37 664 snv2sdn1-snv2mr1.es.net
07 18.330 134.55.218.25 760 seattlesdn1-snv2sdn1.es.net
08 40.638 207.231.240.3 731 canet-1-lo-jmb-706.sttlwa.pacificwave.net
09 61.247 205.189.32.193 816 c4-bcnet.canet4.net
10 63.165 207.23.240.189 788 UVICA-ORAN.BC.net
rtt = 63.663 ( 142.104.193.217 )

The columns are the hop number, the RTT, the IP address of the router and the time in microseconds in which the load packets are processed by that hop. The last column is the name of the node. By default the load packets are 33.3 KB. So, to calculate the dummy estimate of bandwidth, you need to divide 33300 by the longest time (844 microsecs for phys02.comp.uvic.ca). This yields a bottleneck available bandwidth of about 310 Mbits/s or 39MBytes/s.
Achievable Network Throughput
For a Round Trip Time (RTT) of about 65 msec and a 100Mbits/s bottleneck the bandwidth * delay product is about 800KBytes. Given that the hosts at SLAC and UVic restrict the windows to 131KBytes and Linux doubles the specified window size we need about 12 streams each with a 64KByte window to saturate a 100Mbits/s bottleneck. We therefore used iperf to send data (memory to memory) from SLAC to UVic varying the requested TCP window sizes as 8, 16, 32, 64, 128KBytes, and for each window size made 20 second iperf TCP measurements with @streams=(1,20,8,25,12,45,4,60,12,80,2,100,32). We recorded the window and stream sizes together with the min/avg/max ping RTTs (without iperf running) the aggregate throughput together with the minimum and maximum individual stream throughputs, the user and kernel cpu utilizations, and the actual time taken by the transfer.
The slow start time (see Measuring end-to-end Bandwidth with iperf using Web100 by Tirumala, Cottrell and Dunigan) for an 800KByte bandwidth delay product is about 1 second, so using an iperf duration of 20 seconds means most of the transfer (~95%) is in the more stable Additive Increase Multiplicative Decrease (AIMD) TCP state.

The results are shown in the figure below. It is seen that for smaller values of window * streams the growth in throughput is roughly linear with the number of streams. Not that the maximum window size is constrained by the TCP window size configurationm parameters to be about 130KB and the requested window is doubled, so the difference in the results for the 64KB and 128KB windows is mainly statistical. It can also be seen that the TCP throughput achievable with iperf saturates at between 250 and 300 Mbits/s.

Conclusions
The achievable throughput and available bandwidth estimates agree reasonably well. Thus we expect that we can achieve TCP throughputs of 250-300Mbits/s from SLAC to UVic. The 30Mbits/s reported by UVic (30Mbits/s) is an order of magnitude lower. This may be partly due to mercury having a 100Mbit/s NIC (I think it has a 1GE NIC so this is very speculative, I do not have logon capability to access mercury so I cannot verify this) in which case the bottleneck bandwidth would be closer to 70Mbits/s (estimated by running pathneck from pinger.slac.stanford.edu (a host with a 100Mbits/s NIC) to UVic..
With the default configuration TCP windows size setting (~130KB) this requires one to use a large number of parallel streams (e.g over 60). If the windows sizes were to be increased (see TCP Tuning Guide) then the number of streams could probably be reduced. The problem with using a large number of streams is that it is unfair to other network users who are typically only using one stream. If one uses 10 streams as the current application is configured with the default max TCP window of ~130KB, then one ought to be able to achieve about 70Mbits/s. If this is not possible then it may be due to the application or file system/disks etc.

The Norics are a cluster of hosts that are meant for interactive work and thus should not have their TCP window sizes incsreased. The other host mentioned at SLAC, bbr-xfer06, is a four 450MHz processor Sun Sparcv9 running SunOS 5.9. The maximum cpu utilization in the measurements above with Noric was about 0.1% so CPU should not be a problem (note that within a factor of 2 Xeon/Linux and Solaris/SunOS hosts use the same MHz / Mbits/s). The maximum TCP window size on bbr-xfer06 is about 1MBytes. Thus if the window size at UVic is increased we can use bbr-xfer06 to retry iperf throughputs with larger windows (note that the actual window size used is the miniumum of the sender and receiver window sizes).

Other possibilities would be to:
Use an advanced TCP stack (see for example A Survey of Transport Protocols other than Standard TCP by He, Vicat-Blanc and Welzl or Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks by Bullot, Cottrell and Hughes-Jones) which recovers from congestion events more quickly and backs off less when faced with congestion. The new stack is only needed on the sender. These stacks however currently do not exist for Solaris. The easiest way to experiment would be upgrade to Linux 2.6.16 (which contains options to choose different TCP stacks) or more recent, or to install a non standard TCP stack. The latter is likely to run afoul of security who have big concerns about non-standard TCP stacks.
Try another application such as bbcp or GridFTP. Besides supporting large windows and multiple parallel streams (as does bbftp) these also support memory to memory transfers and thus enable one to isolate whether the problem is with the file/disk subsytems or elsewhere.
Follow up
We installed the latest version of pathneck (1.3) that provides extra information:
45cottrell@iepm-resp:~>sudo ~mak/pathneck -v -o phys02.comp.uvic.ca
probe_once() with [60 0]
1165182286.865677 142.104.21.13 500 60 0

00 7.970 134.79.243.1 121 813 . 1981.561 ub
01 0.307 134.79.252.5 813 813 . 294.854 lb
02 0.285 134.79.135.15 851 851 . 281.970 lb
03 0.369 192.68.191.146 900 851 . 266.375 lb
04 0.582 134.55.209.93 686 686 3 0.000 uk
05 0.964 134.55.217.2 627 686 . 382.750 lb
06 0.963 134.55.207.37 772 736 . 310.498 lb
07 18.441 134.55.218.25 736 736 . 325.666 lb
08 40.847 207.231.240.3 704 736 . 340.885 lb
09 61.398 205.189.32.193 833 833 2 287.774 ub
10 63.466 207.23.240.189 963 963 1 248.982 ub
conf = 0.135 0.116 0.241
rtt = 63.686 ( 142.104.21.13 )

where:
The 1st line:
probing_start_timestamp (local time), dst_ip, load_pkt_size,
load_pkt_num, delay_num

The main body (the 7-column thing):
1st column: the index number
2nd column: the RTT to that hop
3rd column: hop IP address
4th column: raw gap measurement (in microsecond)
5th column: "smooth"ed gap values (in microsecond), they are also
the input for the dynamic algo
6th column: choke points, "." represents non-choke point, "1", "2",
"3" refers to the choke points (output at most 3), ordered
by the confidence value.
7th column: bandwidth bound estimation, unit: Mbps
8th column: ub->upper bound, lb->lower bound, uk->unknown (see the
pathneck paper section 2.3.3 for the detail.)
9th column: hop hostname

The "conf" line:
the confidence values for the choke points detected, in the order of
"1", "2", "3".

The "rtt" line:
the RTT to the destination, dst_ip, and the hop count *from* dst to
source (i.e., return path, not forward path).

Ian Gable of UVic set the TCP windows on phys02.comp.uvic.ca:
[root@phys02 init.d]# sysctl -p
net.ipv4.ip_forward = 0
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.conf.default.rp_filter = 1
kernel.sysrq = 0
kernel.core_uses_pid = 1

At the SLAC end we used iepm-resp, a two 2.4GHz Xeon cpu with a 1GE NIC and windows set:
########################################
# TCP windows/buffer information #
########################################
cat /proc/sys/net/core/wmem_max = 20971520
;cat /proc/sys/net/core/rmem_max = 20971520
;cat /proc/sys/net/core/rmem_default = 65536
;cat /proc/sys/net/core/wmem_default = 65536
;cat /proc/sys/net/ipv4/tcp_rmem = 4096 87380 20971520
;cat /proc/sys/net/ipv4/tcp_wmem = 4096 65536 20971520
;cat /proc/sys/net/ipv4/tcp_mem = 16384 16384 16384

On 12/3/06,we repeated the above iperf TCP throughput measurements but adding window sizes of 256kBytes, 512KBytes and 1024kBytes at the SLAC (sender) end and with phys02.comp.uvic.ca using a window size of 1MByte:
[cottrell@phys02 cottrell]$ iperf -s -w 1024k -p 5000


Shanghai dieying science & technology co., Ltd | Chance
Room 1305 No.341 ShangCheng Road,Shanghai+0086-21-58878998 11394019
dieying@541help.com +0086-21-58879030 HappyFreeAngel@hotmail.com

Dieying science&techology co.,Ltd All rights Resverved 2007-2010 Administrator Boot