For one full week, between 26 September 2018 and 3 October 2018, my UDP Tester ran on 2 computers, one in the UK and one in Uruguay (UY), sending and receiving UDP messages in both directions. On each side, the receiver ran continuously, logging all UDP messages that it received during the whole interval. By contrast, the sender ran on both sides hourly but at different times so that the communications did not overlap. I don’t expect it would have been any trouble even if they did overlap but this was meant to be a test of UDP under best conditions and for this reason I set the times so that the sender on one end always finished a full run before the one on the other end started. For the same reason, the messages were sent at a rate of at most 1 message per second1.
At each run, the sender sent exactly 2043 UDP messages with lengths between 6 and 2048, each message having a different length. The order of messages was pseudo-random, relying on the Mersenne Twister prng using as seed the local time at the start of the run (in unix format). The sender kept a log of all messages it sent, including destination IP and port, seed used for MT, local time when message was sent and size of message. The receiver also logged basic information about each message: source IP and port, local time and message size as observed by receiver as well as those contained in the message’s own header, number of observed incorrect bits in the message’s payload as well as the expected and actual values of incorrect octets.
A first look at the week-long data yielded a bit of a surprise in that the UY receiver had actually received *more* messages than were sent from the UK! At a closer look, it turned out that 4933 UDP messages arriving at the UY node were actually sent by its own local switch! And moreover, they were all, without exception, recorded as corrupted since neither size nor payload matched the expected values2. At the moment those switch-generated messages are a bit of a mystery - it’s unclear what they are exactly or why and how they appeared. Working hypotheses would be that they are either local dhcp messages (although the port number would be a weird choice for those) or stray frags of bigger UDP messages. My one single attempt to replicate this behaviour while simultaneously capturing everything with tcpdump has so far failed - there were no such unexpected messages at all over several hours of UK sender at work. I might perhaps try again at a later date after I’m done with the more pressing tests that I need for SMG comms or simply process the existing error log and reconstitute the already observed weird messages from there. Anyway, for now I put those anomalous messages to the side and focus instead on the rest of the messages (which were sent as expected either by the node in the UK or by the one in UY). Here’s a summary of the data thus cleaned:
UK node | UY node | |
Total sent: | 3459283 | 3452674 |
Total received: | 3447175 | 3451836 |
% received: | 99.84%7 | 99.78%8 |
Errors received9: | 0 | 0 |
Arguably the lost messages are of most interest in all the above: can one say perhaps that the largest messages10 get lost more often? Not really or at least not based on this little set of data. Compare the summary stats for three groups of messages: all messages sent from the UK (reflecting as expected the sizes sent and the fact that the same number of messages of each size are sent), all messages lost at UY (i.e. did not make it on the way from UK to UY) and all messages lost at UK (i.e. did not make it on the way from UY to UK):
Data | Min | 1st Q | Median | Mean | 3rd Q | Max |
All sent from UK | 6 | 516 | 1027 | 1027 | 1538 | 2048 |
Lost at UY (UK->UY) | 13 | 513 | 1049 | 1051 | 1602 | 2045 |
Lost at UK (UY->UK) | 16 | 553.5 | 1072 | 1061 | 1576 | 2047 |
While the data set of lost messages is quite small (550 messages lost at UK and 745 at UY), note that this is mainly due to the fact that there are relatively few losses overall: less than 0.4% of messages sent got lost on the way. So it would seem that at least under the conditions and on the routes considered11, UDP is not all that unreliable anyway. In any case, those summaries above seem to me remarkably close to one another - meaning that there isn’t any visible evidence that some sizes would get lost more than others, at least not for the set of sizes considered. Arguably sizes of up to 2048 octets of message are quite fine for communications over UDP - or at any rate, just as fine as smaller sizes.
In terms of order of received messages, the UY node received ALL messages precisely in the order in which they were sent but the UK node reported 66 messages in total that arrived out of order. Although this is a tiny number, it is perhaps reasonable to assume that it might increase in worse conditions (e.g. significantly less than 1 second between sending messages).
The actual timings are a bit iffier to investigate since the precision of UDP Tester turns out to be less than what would be needed for such task. Moreover, there is something weird going on with the way I recorded the time because the difference between the two nodes should be of ~34 seconds (UY node local time = UK node local time + 34) but this doesn't quite square with all the data especially at the UY receiver end12. On the more positive side though, at least the measurement bias there is constant for all the data and it doesn't introduce any weird effects so I can still attempt to infer something considering also that observed behaviour suggests that most UDP messages really make it to the other end within 1 second. Consequently, I calculated the delta on both sides as TR - TS at first and then I added (on UK side) respectively subtracted (on UY side) the quantity needed to make the lowest delta 0. So at the UY receiver, delta = TR - TS - 11 while at the UK receiver, delta = TR - TS + 32. With this correction, the summary stats for the delta on both sides turn out to be remarkably similar:
Data | Min | 1st Q | Median | Mean | 3rd Q | Max |
Deltas at UK node: | 0 | 5 | 11 | 10.60 | 16 | 21 |
Deltas at UY node: | 0 | 5 | 11 | 10.62 | 16 | 21 |
Note that I do *not* recommend taking the above delta values for anything really, as the tester's precision in recording time is just not enough for this.
You are of course warmly invited to run your own tests and to play with this dataset in any way you find fit. So here's the data from both nodes, including the additional 4933 messages that the UY node received from its own switch:
- udp_test_take1_data.zip (~10MB)
- SHA512SUM: 963b8a1467630eea35532122ab7c2d25cb8741001808841f7cf02b34abb6ad5300adcb1d667dd902b4278dd2b373dc46427b0b0bbc918ee52f326456535a4114 udp_test_take1_data.zip
Have fun!
Specifically: the sender had a delay of 1 second between any two consecutive messages. ↩
The UDP tester simply fills the message up to any length with values calculated as Pos mod 256 where Pos is the position of the respective octet in the full message. ↩
This is precisely 345928/2043=169 runs. ↩
This includes a partial 170th run since I stopped the whole test while the UY sender was running already its 170th run. ↩
550 messages lost in total. ↩
745 messages lost in total. ↩
344717 / 345267 * 100 ↩
345183 / 345928 * 100 ↩
This refers to messages received but with payloads that don’t match the expected values. ↩
Note that this test capped the messages at 2048 so “largest” here means strictly < 2049 octets. ↩
The UK node is a “consumer” node i.e. behind a router and on a residential connection; the UY node is S.MG’s test server with Pizarro. ↩
Considering TR as TimeReceived and TS as TimeSent, at UY receiver the delta should be calculated as TR - (TS + 34) = TR - TS - 34; however, there are entries with TR-TS as low as 11 so basically it would seem that messages arrived before they were even sent. ↩