Saturday, September 30, 2023
HomeSoftware DevelopmentThe code labored in a different way when the moon was full

# The code labored in a different way when the moon was full

I love a good bug, especially ones that are initially hard to explain but then turn into forehead slapping moments – of course!

There’s a bug over on Github called Hysteresis effect on threadpool hill-climbing that is a super interesting read.

Hill climbing is an algorithmic technique when you have a problem (a hill) and then you improve and improve (climb) until you have reached some maximum acceptable solution (reach the top).

Sébastien, the OP of the bug, says there’s a “Hysteresis effect” on the threadpool. “Hysteresis is the dependence of the state of a system on its history.” Something weird is happening now because something happened before…but what was it?

Sawtooth up and down graphs aren’t THAT interesting…but look at the x-axis. This isn’t showing minute by minute or even millisecond by millisecond ups and downs like you may have seen before. This x-axis uses months as its unit of measure. Read that again and drink it in.

Things are cool in February until they are bad for weeks and then cool in March and it moves along a cycle. Not strictly the cycle of the moon but close.

He’s seeing the number of cores being used changing from month to month when using PortableThreadPool

We have noticed a periodic pattern on the threadpool hill-climbing logic, which uses either n-cores or n-cores + 20 with an hysteresis effect that switches every 3-4 weeks.

Did you know (I know because I’m old) that Windows 95, for a time, was unable to run longer than 49.7 days of runtime? If you ran it that long it would eventually crash! That’s because there’s 86M ms in a day, i.e. 1000 * 60 * 60 * 24 = 86,400,000 and 32 bits is 4,294,967,296 so 4,294,967,296 / 86,400,000 = 49.7102696 days!

Kevin in the Github issues notes this as well:

The whole period of square wave sounds an awful lot like it is around 49.7 days. That is how long it takes GetTickCount() to wrap around. On POSIX platforms the platform abstraction layer implements this, and the value returned for that is based not on uptime but on wall clock time, which matches with all machines changing on the same day.

This 49.7 days number is well known as it’s how long it take before GetTickCount() overflows/wraps. Kevin goes on to actually give the changeover dates which correspond to the graph!

• Thu Jan 14 2021
• Sun Feb 07 2021
• Thu Mar 04 2021
• Mon Mar 29 2021
• Fri Apr 23 2021

He then finds the code in PortableThreadPool.cs that explains the issue:

{
// We need to subtract by prior time because Environment.TickCount can wrap around, making a comparison of absolute times unreliable.
int requiredInterval = _separated.nextCompletedWorkRequestsTime - priorTime;
int elapsedInterval = currentTimeMs - priorTime;
if (elapsedInterval >= requiredInterval)
{
...

He says, and this is all Kevin:

currentTimeMs is Environment.TickCount, which in this case happens to be negative.

The if clause controls if the hill climbing is even run.

_separated.priorCompletedWorkRequestsTime and _separated.nextCompletedWorkRequestsTime start out as zero on process start, and only get updated if the hill climbing code is run.

Therefore, requiredInterval = 0 - 0 and elapsedInterval = negativeNumber - 0. This causes the if statement to become
if (negativeNumber - 0 >= 0 - 0) which returns false, so the hill climbing code is not run, and therefore the variables never get updated, and remain zero. The native version of the thread pool code does all math with unsigned numbers which would avoid such a bug, and it’s equivalent part is not even quite the same math in the first place.

The easy fix here is probably to use unsigned arithmetic, but alternatively having the two fields get initialized to Environment.TickCount probably also work

Back to me. Fabulous. The fix is to cast the results to unsigned ints via (uint).

Before:

int requiredInterval = _separated.nextCompletedWorkRequestsTime - priorTime;
int elapsedInterval = currentTimeMs - priorTime;

After:

uint requiredInterval = (uint)(_separated.nextCompletedWorkRequestsTime - priorTime);
uint elapsedInterval = (uint)(currentTimeMs - priorTime);

What an interesting and insidious bug! Bugs based on a time calculations can often show themselves later when view through a longer lens and scope of time…sometimes WAY longer than you’d expect.

Sponsor: YugabyteDB is a distributed SQL database designed for resilience and scale. It is 100% open source, PostgreSQL-compatible, enterprise-grade, and runs across all clouds. Sign up and get a free t-shirt.

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

Hosting By

RELATED ARTICLES