Matching Wildcards: An Improved Algorithm for Big Data

By Kirk J Krauss

Over the past ten years, ideas about how to improve on the existing algorithms for matching wildcards have come to me now and then. Meanwhile, there have been requests for permission to include the two routines that I’ve published in various projects, occasional suggestions for improved correctness or completeness, and various thoughts that have arrived out of the blue. One thought was this: though the approach I took in 2014 — tuning my first non-recursive routine of 2008 using a performance profiler — turned out to be successful, tuning code is no substitute for designing it as well as can be from the start, for the use case at hand.

A profiler has nothing to say about the ways in and out of string matching that might be handled more efficiently than how previous routines have gone about it. Something I’ve observed from profiling these routines is that the more special cases can be handled in special logic, off the paths more commonly taken through the code, the faster those commonly taken paths can work. An empty input string represents an increasingly popular case of a commonly taken path. Another thing I’ve realized is that while handling the entire string matching job in a single while loop may seem elegant, it pushes the logic for matching text subsequent to a ‘*’ wildcard to be at, or near the top of, another commonly taken path.

Wildcards and Big Data

Some important use cases for any general purpose routine for matching wildcards come up in the course of processing sparse data. Since big data use cases are often just as well described as sparse data use cases that scale, the routine can be handed lots of empty strings just when it needs to be at its most efficient. That is, either the “tame’ or “wild” input strings may be empty, over and over, across a large data set. The routine also may be called for long strings that don’t contain any wildcards, given the off-chance that a wildcard may be in there now and then. In cases like these, the routine ideally should act as much as possible like one of its non-wildcard peers, except for the additional checks for the wildcard characters.

There need not be more than one check for each wildcard character (‘*’ and ‘?’) in just the “wild” input string. Also as long as the “wild” and “tame” strings compare exactly, they can share a single index. In fact, no index variables are altogether necessary, other than in portable versions of the routine that avoid direct pointer manipulation. For optimal performance, pointers can serve in lieu of variables that must be set and checked throughout the routine, and no index variables need be initialized at all unless a ‘*’ wildcard is found. This can make empty string handling for those sparse data sets about as speedy as it could be with any string comparison routines, even those that don’t handle wildcards.

The code for the fast pointer-based implementation I’ve just described appears in the FastWildCompare() function of Listing 1. A more portable version, which relies on index variables rather than pointers, appears in the FastWildComparePortable() function of Listing 2. Both routines expect null-terminated strings as input, so they aren’t usable with certain text representations, such as objects of the String class in Java. But the portable version can be used with wide characters or other internationalized text. It can easily be ported to any languages and platforms that permit access to a terminating character for each input string.

Listing One

// Copyright 2018 IBM Corporation
// 
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// 
//     http://www.apache.org/licenses/LICENSE-2.0
// 
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//
// Compares two text strings.  Accepts '?' as a single-character wildcard.  
// For each '*' wildcard, seeks out a matching sequence of any characters 
// beyond it.  Otherwise compares the strings a character at a time. 
//
bool FastWildCompare(char *pWild, char *pTame)
{
	char *pWildSequence;  // Points to prospective wild string match after '*'
	char *pTameSequence;  // Points to prospective tame string match

	// Find a first wildcard, if one exists, and the beginning of any  
	// prospectively matching sequence after it.
	do
	{
		// Check for the end from the start.  Get out fast, if possible.
		if (!*pTame)
		{
			if (*pWild)
			{
				while (*(pWild++) == '*')
				{
					if (!(*pWild))
					{
						return true;   // "ab" matches "ab*".
					}
				}

			    return false;          // "abcd" doesn't match "abc".
			}
			else
			{
				return true;           // "abc" matches "abc".
			}
		}
		else if (*pWild == '*')
		{
			// Got wild: set up for the second loop and skip on down there.
			while (*(++pWild) == '*')
			{
				continue;
			}

			if (!*pWild)
			{
				return true;           // "abc*" matches "abcd".
			}

			// Search for the next prospective match.
			if (*pWild != '?')
			{
				while (*pWild != *pTame)
				{
					if (!*(++pTame))
					{
						return false;  // "a*bc" doesn't match "ab".
					}
				}
			}

			// Keep fallback positions for retry in case of incomplete match.
			pWildSequence = pWild;
			pTameSequence = pTame;
			break;
		}
		else if (*pWild != *pTame && *pWild != '?')
		{
			return false;              // "abc" doesn't match "abd".
		}

		++pWild;                       // Everything's a match, so far.
		++pTame;
	} while (true);

	// Find any further wildcards and any further matching sequences.
	do
	{
		if (*pWild == '*')
		{
			// Got wild again.
			while (*(++pWild) == '*')
			{
				continue;
			}

			if (!*pWild)
			{
				return true;           // "ab*c*" matches "abcd".
			}

			if (!*pTame)
			{
				return false;          // "*bcd*" doesn't match "abc".
			}

			// Search for the next prospective match.
			if (*pWild != '?')
			{
				while (*pWild != *pTame)
				{
					if (!*(++pTame))
					{
						return false;  // "a*b*c" doesn't match "ab".
					}
				}
			}

			// Keep the new fallback positions.
			pWildSequence = pWild;
			pTameSequence = pTame;
		}
		else if (*pWild != *pTame && *pWild != '?')
		{
			// The equivalent portion of the upper loop is really simple.
			if (!*pTame)
			{
				return false;          // "*bcd" doesn't match "abc".
			}

			// A fine time for questions.
			while (*pWildSequence == '?')
			{
				++pWildSequence;
				++pTameSequence;
			}

			pWild = pWildSequence;

			// Fall back, but never so far again.
			while (*pWild != *(++pTameSequence))
			{
				if (!*pTameSequence)
				{
					return false;      // "*a*b" doesn't match "ac".
				}
			}

			pTame = pTameSequence;
		}

		// Another check for the end, at the end.
		if (!*pTame)
		{
			if (!*pWild)
			{
				return true;           // "*bc" matches "abc".
			}
			else
			{
				return false;          // "*bc" doesn't match "abcd".
			}
		}

		++pWild;                       // Everything's still a match.
		++pTame;
	} while (true);
}

Listing Two

// Copyright 2018 IBM Corporation
// 
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// 
//     http://www.apache.org/licenses/LICENSE-2.0
// 
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//
// Slower but portable version of FastWildCompare().  Performs no direct 
// pointer manipulation.  Can work with wide-character text strings.  Use 
// only with null-terminated strings.
//
// Compares two text strings.  Accepts '?' as a single-character wildcard.  
// For each '*' wildcard, seeks out a matching sequence of any characters 
// beyond it.  Otherwise compares the strings a character at a time.
//
bool FastWildComparePortable(char *strWild, char *strTame)
{
	int  iWild = 0;     // Index for both tame and wild strings in upper loop
	int  iTame;         // Index for tame string, set going into lower loop
	int  iWildSequence; // Index for prospective match after '*' (wild string)
	int  iTameSequence; // Index for prospective match (tame string)

	// Find a first wildcard, if one exists, and the beginning of any  
	// prospectively matching sequence after it.
	do
	{
		// Check for the end from the start.  Get out fast, if possible.
		if (!strTame[iWild])
		{
			if (strWild[iWild])
			{
				while (strWild[iWild++] == '*')
				{
					if (!strWild[iWild])
					{
						return true;   // "ab" matches "ab*".
					}
				}

			    return false;          // "abcd" doesn't match "abc".
			}
			else
			{
				return true;           // "abc" matches "abc".
			}
		}
		else if (strWild[iWild] == '*')
		{
			// Got wild: set up for the second loop and skip on down there.
			iTame = iWild;

			while (strWild[++iWild] == '*')
			{
				continue;
			}

			if (!strWild[iWild])
			{
				return true;           // "abc*" matches "abcd".
			}

			// Search for the next prospective match.
			if (strWild[iWild] != '?')
			{
				while (strWild[iWild] != strTame[iTame])
				{
					if (!strTame[++iTame])
					{
						return false;  // "a*bc" doesn't match "ab".
					}
				}
			}

			// Keep fallback positions for retry in case of incomplete match.
			iWildSequence = iWild;
			iTameSequence = iTame;
			break;
		}
		else if (strWild[iWild] != strTame[iWild] && strWild[iWild] != '?')
		{
			return false;              // "abc" doesn't match "abd".
		}

		++iWild;                       // Everything's a match, so far.
	} while (true);

	// Find any further wildcards and any further matching sequences.
	do
	{
		if (strWild[iWild] == '*')
		{
			// Got wild again.
			while (strWild[++iWild] == '*')
			{
				continue;
			}

			if (!strWild[iWild])
			{
				return true;           // "ab*c*" matches "abcd".
			}

			if (!strTame[iTame])
			{
				return false;          // "*bcd*" doesn't match "abc".
			}

			// Search for the next prospective match.
			if (strWild[iWild] != '?')
			{
				while (strWild[iWild] != strTame[iTame])
				{
					if (!strTame[++iTame])
					{
						return false;  // "a*b*c" doesn't match "ab".
					}
				}
			}

			// Keep the new fallback positions.
			iWildSequence = iWild;
			iTameSequence = iTame;
		}
		else if (strWild[iWild] != strTame[iTame] && strWild[iWild] != '?')
		{
			// The equivalent portion of the upper loop is really simple.
			if (!strTame[iTame])
			{
				return false;          // "*bcd" doesn't match "abc".
			}

			// A fine time for questions.
			while (strWild[iWildSequence] == '?')
			{
				++iWildSequence;
				++iTameSequence;
			}

			iWild = iWildSequence;

			// Fall back, but never so far again.
			while (strWild[iWild] != strTame[++iTameSequence])
			{
				if (!strTame[iTameSequence])
				{
					return false;      // "*a*b" doesn't match "ac".
				}
			}

			iTame = iTameSequence;
		}

		// Another check for the end, at the end.
		if (!strTame[iTame])
		{
			if (!strWild[iWild])
			{
				return true;           // "*bc" matches "abc".
			}
			else
			{
				return false;          // "*bc" doesn't match "abcd".
			}
		}

		++iWild;                       // Everything's still a match.
		++iTame;
	} while (true);
}

Design and Testing Strategies

In the routines of both Listing 1 and Listing 2, the overall algorithm is identical. A check for input that’s reached its end happens at the top of the first loop. This picks off empty input strings and returns right away. A check for a ‘*’ wildcard happens next; only if one of those is found do any variables used for subsequent matching get initialized. Some up-front steps are performed in those wildcard scenarios, and the index variables used in the second loop are set, before control passes to the second loop. But if no wildcard is found, then the second loop isn’t invoked, and the remainder of the first loop serves merely to compare the characters of the input strings. This simplifies the handling of all-tame comparisons.

The overall design of the second loop is like a rearrangement of the first loop, beginning with a check for any further ‘*’ wildcard, followed by another round of input character comparison, and ending with a check for the end of the input. That last check is useful when a sequence of the tame string begins to match with a sequence of the wild string following the ‘*’. If one of the strings ends, and the other does not, then a match isn’t possible unless some special cases apply to the wild string. These are checked in logic designated with comments such as “ab*c*” matches “abcd”. That case, where a wild string ends in ‘*’, is picked off so the routine returns without comparing more of the tame string than necessary. Other special cases involve various conditions where characters in the wild string don’t appear in the tame string, and vice versa. How to know whether all the special cases that might arise are properly handled? That’s where thorough testing comes in.

Any routine is as reliable as the most comprehensive available tests show it to be. Validation test cases for matching wildcards have been coming to me over the years, and though the collection grew rapidly at first, the incoming rate has gradually slowed to a trickle. The code of Listings 1 and 2 passes every test I’ve collected, as well as some new ones I’ve added mainly for performance comparison. Listings 3 through 5 provide both correctness and performance testing for, respectively, input containing wildcards in various arrangements, input containing plain text strings, and input containing empty strings.

Listing Three

// Copyright 2018 IBM Corporation
// 
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// 
//     http://www.apache.org/licenses/LICENSE-2.0
// 
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//
#include ‹stdio.h›

#define COMPARE_PERFORMANCE  1
#define COMPARE_WILD         1
#define COMPARE_TAME         1
#define COMPARE_EMPTY        1

bool test(char *pTame, char *pWild, bool bExpectedResult)
{
	bool bPassed = true;

	if (bExpectedResult != FastWildCompare(pWild, pTame))
	{
		bPassed = false;
	}

#if defined(COMPARE_PERFORMANCE)
	if (bExpectedResult != WildTextCompare(pTame, pWild))
	{
		bPassed = false;
	}
#endif

	return bPassed;
}

int testwild(void)
{
    int  nReps;
    bool bAllPassed = true;

#if defined(COMPARE_PERFORMANCE)
    // Can choose as many repetitions as you're expecting in the real world.
    nReps = 1000000;
#else
    nReps = 1;
#endif

    while (nReps--)
    {
		// Case with first wildcard after total match.
        bAllPassed &= test("Hi", "Hi*", true);
		
		// Case with mismatch after '*'
        bAllPassed &= test("abc", "ab*d", false);

        // Cases with repeating character sequences.
        bAllPassed &= test("abcccd", "*ccd", true);
        bAllPassed &= test("mississipissippi", "*issip*ss*", true);
        bAllPassed &= test("xxxx*zzzzzzzzy*f", "xxxx*zzy*fffff", false);
        bAllPassed &= test("xxxx*zzzzzzzzy*f", "xxx*zzy*f", true);
        bAllPassed &= test("xxxxzzzzzzzzyf", "xxxx*zzy*fffff", false);
        bAllPassed &= test("xxxxzzzzzzzzyf", "xxxx*zzy*f", true);
        bAllPassed &= test("xyxyxyzyxyz", "xy*z*xyz", true);
        bAllPassed &= test("mississippi", "*sip*", true);
        bAllPassed &= test("xyxyxyxyz", "xy*xyz", true);
        bAllPassed &= test("mississippi", "mi*sip*", true);
        bAllPassed &= test("ababac", "*abac*", true);
        bAllPassed &= test("ababac", "*abac*", true);
        bAllPassed &= test("aaazz", "a*zz*", true);
        bAllPassed &= test("a12b12", "*12*23", false);
        bAllPassed &= test("a12b12", "a12b", false);
        bAllPassed &= test("a12b12", "*12*12*", true);

#if !defined(COMPARE_PERFORMANCE)
		// From DDJ reader Andy Belf
        bAllPassed &= test("caaab", "*a?b", true);
#endif

        // Additional cases where the '*' char appears in the tame string.
        bAllPassed &= test("*", "*", true);
        bAllPassed &= test("a*abab", "a*b", true);
        bAllPassed &= test("a*r", "a*", true);
        bAllPassed &= test("a*ar", "a*aar", false);

        // More double wildcard scenarios.
        bAllPassed &= test("XYXYXYZYXYz", "XY*Z*XYz", true);
        bAllPassed &= test("missisSIPpi", "*SIP*", true);
        bAllPassed &= test("mississipPI", "*issip*PI", true);
        bAllPassed &= test("xyxyxyxyz", "xy*xyz", true);
        bAllPassed &= test("miSsissippi", "mi*sip*", true);
        bAllPassed &= test("miSsissippi", "mi*Sip*", false);
        bAllPassed &= test("abAbac", "*Abac*", true);
        bAllPassed &= test("abAbac", "*Abac*", true);
        bAllPassed &= test("aAazz", "a*zz*", true);
        bAllPassed &= test("A12b12", "*12*23", false);
        bAllPassed &= test("a12B12", "*12*12*", true);
        bAllPassed &= test("oWn", "*oWn*", true);

        // Completely tame (no wildcards) cases.
        bAllPassed &= test("bLah", "bLah", true);
        bAllPassed &= test("bLah", "bLaH", false);

        // Simple mixed wildcard tests suggested by Marlin Deckert.
        bAllPassed &= test("a", "*?", true);
        bAllPassed &= test("ab", "*?", true);
        bAllPassed &= test("abc", "*?", true);

        // More mixed wildcard tests including coverage for false positives.
        bAllPassed &= test("a", "??", false);
        bAllPassed &= test("ab", "?*?", true);
        bAllPassed &= test("ab", "*?*?*", true);
        bAllPassed &= test("abc", "?**?*?", true);
        bAllPassed &= test("abc", "?**?*&?", false);
        bAllPassed &= test("abcd", "?b*??", true);
        bAllPassed &= test("abcd", "?a*??", false);
        bAllPassed &= test("abcd", "?**?c?", true);
        bAllPassed &= test("abcd", "?**?d?", false);
        bAllPassed &= test("abcde", "?*b*?*d*?", true);

        // Single-character-match cases.
        bAllPassed &= test("bLah", "bL?h", true);
        bAllPassed &= test("bLaaa", "bLa?", false);
        bAllPassed &= test("bLah", "bLa?", true);
        bAllPassed &= test("bLaH", "?Lah", false);
        bAllPassed &= test("bLaH", "?LaH", true);

        // Many-wildcard scenarios.
        bAllPassed &= test("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab", 
            "a*a*a*a*a*a*aa*aaa*a*a*b", true);
        bAllPassed &= test("abababababababababababababababababababaacacacacaca\
cacadaeafagahaiajakalaaaaaaaaaaaaaaaaaffafagaagggagaaaaaaaab", 
            "*a*b*ba*ca*a*aa*aaa*fa*ga*b*", true);
        bAllPassed &= test("abababababababababababababababababababaacacacacaca\
cacadaeafagahaiajakalaaaaaaaaaaaaaaaaaffafagaagggagaaaaaaaab", 
            "*a*b*ba*ca*a*x*aaa*fa*ga*b*", false);
        bAllPassed &= test("abababababababababababababababababababaacacacacaca\
cacadaeafagahaiajakalaaaaaaaaaaaaaaaaaffafagaagggagaaaaaaaab", 
            "*a*b*ba*ca*aaaa*fa*ga*gggg*b*", false);
        bAllPassed &= test("abababababababababababababababababababaacacacacaca\
cacadaeafagahaiajakalaaaaaaaaaaaaaaaaaffafagaagggagaaaaaaaab", 
            "*a*b*ba*ca*aaaa*fa*ga*ggg*b*", true);
        bAllPassed &= test("aaabbaabbaab", "*aabbaa*a*", true);
        bAllPassed &= test("a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*", 
            "a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*", true);
        bAllPassed &= test("aaaaaaaaaaaaaaaaa", 
            "*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*", true);
        bAllPassed &= test("aaaaaaaaaaaaaaaa", 
            "*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*", false);
        bAllPassed &= test("abc*abcd*abcde*abcdef*abcdefg*abcdefgh*abcdefghi*a\
bcdefghij*abcdefghijk*abcdefghijkl*abcdefghijklm*abcdefghijklmn", 
            "abc*abc*abc*abc*abc*abc*abc*abc*abc*abc*abc*abc*abc*abc*abc*abc*a\
            bc*", false);
        bAllPassed &= test("abc*abcd*abcde*abcdef*abcdefg*abcdefgh*abcdefghi*a\
bcdefghij*abcdefghijk*abcdefghijkl*abcdefghijklm*abcdefghijklmn", 
            "abc*abc*abc*abc*abc*abc*abc*abc*abc*abc*abc*abc*", true);
        bAllPassed &= test("abc*abcd*abcd*abc*abcd", "abc*abc*abc*abc*abc", 
            false);
        bAllPassed &= test(
            "abc*abcd*abcd*abc*abcd*abcd*abc*abcd*abc*abc*abcd", 
            "abc*abc*abc*abc*abc*abc*abc*abc*abc*abc*abcd", true);
        bAllPassed &= test("abc", "********a********b********c********", 
            true);
        bAllPassed &= test("********a********b********c********", "abc", 
            false);
        bAllPassed &= test("abc", "********a********b********b********", 
            false);
        bAllPassed &= test("*abc*", "***a*b*c***", true);

        // A case-insensitive algorithm test.
        // bAllPassed &= test("mississippi", "*issip*PI", true);

        // Tests suggested by other DDJ readers
        bAllPassed &= test("", "?", false);
        bAllPassed &= test("", "*?", false);
        bAllPassed &= test("", "", true);
        bAllPassed &= test("a", "", false);
    }

    if (bAllPassed)
    {
        printf("Passed\n");
    }
    else
    {
        printf("Failed\n");
    }

    return 0;
}

int main(void)
{
#if defined(COMPARE_TAME)
	testtame();
#endif

#if defined(COMPARE_EMPTY)
	testempty();
#endif

#if defined(COMPARE_WILD)
	testwild();
#endif

	return 0;
}

Listing Four

// Copyright 2018 IBM Corporation
// 
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// 
//     http://www.apache.org/licenses/LICENSE-2.0
// 
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//
int testtame(void)
{
    int  nReps;
    bool bAllPassed = true;

#if defined(COMPARE_PERFORMANCE)
    // Can choose as many repetitions as you're expecting in the real world.
    nReps = 1000000;
#else
    nReps = 1;
#endif

    while (nReps--)
    {
		// Case with last character mismatch.
        bAllPassed &= test("abc", "abd", false);

        // Cases with repeating character sequences.
        bAllPassed &= test("abcccd", "abcccd", true);
        bAllPassed &= test("mississipissippi", "mississipissippi", true);
        bAllPassed &= test("xxxxzzzzzzzzyf", "xxxxzzzzzzzzyfffff", false);
        bAllPassed &= test("xxxxzzzzzzzzyf", "xxxxzzzzzzzzyf", true);
        bAllPassed &= test("xxxxzzzzzzzzyf", "xxxxzzy.fffff", false);
        bAllPassed &= test("xxxxzzzzzzzzyf", "xxxxzzzzzzzzyf", true);
        bAllPassed &= test("xyxyxyzyxyz", "xyxyxyzyxyz", true);
        bAllPassed &= test("mississippi", "mississippi", true);
        bAllPassed &= test("xyxyxyxyz", "xyxyxyxyz", true);
        bAllPassed &= test("m ississippi", "m ississippi", true);
        bAllPassed &= test("ababac", "ababac?", false);
        bAllPassed &= test("dababac", "ababac", false);
        bAllPassed &= test("aaazz", "aaazz", true);
        bAllPassed &= test("a12b12", "1212", false);
        bAllPassed &= test("a12b12", "a12b", false);
        bAllPassed &= test("a12b12", "a12b12", true);

        // A mix of cases
        bAllPassed &= test("n", "n", true);
        bAllPassed &= test("aabab", "aabab", true);
        bAllPassed &= test("ar", "ar", true);
        bAllPassed &= test("aar", "aaar", false);
        bAllPassed &= test("XYXYXYZYXYz", "XYXYXYZYXYz", true);
        bAllPassed &= test("missisSIPpi", "missisSIPpi", true);
        bAllPassed &= test("mississipPI", "mississipPI", true);
        bAllPassed &= test("xyxyxyxyz", "xyxyxyxyz", true);
        bAllPassed &= test("miSsissippi", "miSsissippi", true);
        bAllPassed &= test("miSsissippi", "miSsisSippi", false);
        bAllPassed &= test("abAbac", "abAbac", true);
        bAllPassed &= test("abAbac", "abAbac", true);
        bAllPassed &= test("aAazz", "aAazz", true);
        bAllPassed &= test("A12b12", "A12b123", false);
        bAllPassed &= test("a12B12", "a12B12", true);
        bAllPassed &= test("oWn", "oWn", true);
        bAllPassed &= test("bLah", "bLah", true);
        bAllPassed &= test("bLah", "bLaH", false);

        // Single '?' cases.
        bAllPassed &= test("a", "a", true);
        bAllPassed &= test("ab", "a?", true);
        bAllPassed &= test("abc", "ab?", true);

        // Mixed '?' cases.
        bAllPassed &= test("a", "??", false);
        bAllPassed &= test("ab", "??", true);
        bAllPassed &= test("abc", "???", true);
        bAllPassed &= test("abcd", "????", true);
        bAllPassed &= test("abc", "????", false);
        bAllPassed &= test("abcd", "?b??", true);
        bAllPassed &= test("abcd", "?a??", false);
        bAllPassed &= test("abcd", "??c?", true);
        bAllPassed &= test("abcd", "??d?", false);
        bAllPassed &= test("abcde", "?b?d*?", true);

        // Longer string scenarios.
        bAllPassed &= test("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab", 
            "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab", true);
        bAllPassed &= test("abababababababababababababababababababaacacacacaca\
cacadaeafagahaiajakalaaaaaaaaaaaaaaaaaffafagaagggagaaaaaaaab", 
            "abababababababababababababababababababaacacacacaca\
cacadaeafagahaiajakalaaaaaaaaaaaaaaaaaffafagaagggagaaaaaaaab", true);
        bAllPassed &= test("abababababababababababababababababababaacacacacaca\
cacadaeafagahaiajakalaaaaaaaaaaaaaaaaaffafagaagggagaaaaaaaab", 
            "abababababababababababababababababababaacacacacaca\
cacadaeafagahaiajaxalaaaaaaaaaaaaaaaaaffafagaagggagaaaaaaaab", false);
        bAllPassed &= test("abababababababababababababababababababaacacacacaca\
cacadaeafagahaiajakalaaaaaaaaaaaaaaaaaffafagaagggagaaaaaaaab", 
            "abababababababababababababababababababaacacacacaca\
cacadaeafagahaiajakalaaaaaaaaaaaaaaaaaffafagaggggagaaaaaaaab", false);
        bAllPassed &= test("abababababababababababababababababababaacacacacaca\
cacadaeafagahaiajakalaaaaaaaaaaaaaaaaaffafagaagggagaaaaaaaab", 
            "abababababababababababababababababababaacacacacaca\
cacadaeafagahaiajakalaaaaaaaaaaaaaaaaaffafagaagggagaaaaaaaab", true);
        bAllPassed &= test("aaabbaabbaab", "aaabbaabbaab", true);
        bAllPassed &= test("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", 
            "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", true);
        bAllPassed &= test("aaaaaaaaaaaaaaaaa", 
            "aaaaaaaaaaaaaaaaa", true);
        bAllPassed &= test("aaaaaaaaaaaaaaaa", 
            "aaaaaaaaaaaaaaaaa", false);
        bAllPassed &= test("abcabcdabcdeabcdefabcdefgabcdefghabcdefghia\
bcdefghijabcdefghijkabcdefghijklabcdefghijklmabcdefghijklmn", 
            "abcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabc", 
			false);
        bAllPassed &= test("abcabcdabcdeabcdefabcdefgabcdefghabcdefghia\
bcdefghijabcdefghijkabcdefghijklabcdefghijklmabcdefghijklmn", 
            "abcabcdabcdeabcdefabcdefgabcdefghabcdefghia\
bcdefghijabcdefghijkabcdefghijklabcdefghijklmabcdefghijklmn", 
			true);
        bAllPassed &= test("abcabcdabcdabcabcd", "abcabc?abcabcabc", 
            false);
        bAllPassed &= test(
            "abcabcdabcdabcabcdabcdabcabcdabcabcabcd", 
            "abcabc?abc?abcabc?abc?abc?bc?abc?bc?bcd", true);
        bAllPassed &= test("?abc?", "?abc?", true);
    }

    if (bAllPassed)
    {
        printf("Passed\n");
    }
    else
    {
        printf("Failed\n");
    }

    return 0;
}

Listing Five

// Copyright 2018 IBM Corporation
// 
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// 
//     http://www.apache.org/licenses/LICENSE-2.0
// 
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//
int testempty(void)
{
    int  nReps;
    bool bAllPassed = true;

#if defined(COMPARE_PERFORMANCE)
    // Can choose as many repetitions as you're expecting in the real world.
    nReps = 1000000;
#else
    nReps = 1;
#endif

    while (nReps--)
    {
		// A simple case
        bAllPassed &= test("", "abd", false);

        // Cases with repeating character sequences
        bAllPassed &= test("", "abcccd", false);
        bAllPassed &= test("", "mississipissippi", false);
        bAllPassed &= test("", "xxxxzzzzzzzzyfffff", false);
        bAllPassed &= test("", "xxxxzzzzzzzzyf", false);
        bAllPassed &= test("", "xxxxzzy.fffff", false);
        bAllPassed &= test("", "xxxxzzzzzzzzyf", false);
        bAllPassed &= test("", "xyxyxyzyxyz", false);
        bAllPassed &= test("", "mississippi", false);
        bAllPassed &= test("", "xyxyxyxyz", false);
        bAllPassed &= test("", "m ississippi", false);
        bAllPassed &= test("", "ababac*", false);
        bAllPassed &= test("", "ababac", false);
        bAllPassed &= test("", "aaazz", false);
        bAllPassed &= test("", "1212", false);
        bAllPassed &= test("", "a12b", false);
        bAllPassed &= test("", "a12b12", false);

        // A mix of cases
        bAllPassed &= test("", "n", false);
        bAllPassed &= test("", "aabab", false);
        bAllPassed &= test("", "ar", false);
        bAllPassed &= test("", "aaar", false);
        bAllPassed &= test("", "XYXYXYZYXYz", false);
        bAllPassed &= test("", "missisSIPpi", false);
        bAllPassed &= test("", "mississipPI", false);
        bAllPassed &= test("", "xyxyxyxyz", false);
        bAllPassed &= test("", "miSsissippi", false);
        bAllPassed &= test("", "miSsisSippi", false);
        bAllPassed &= test("", "abAbac", false);
        bAllPassed &= test("", "abAbac", false);
        bAllPassed &= test("", "aAazz", false);
        bAllPassed &= test("", "A12b123", false);
        bAllPassed &= test("", "a12B12", false);
        bAllPassed &= test("", "oWn", false);
        bAllPassed &= test("", "bLah", false);
        bAllPassed &= test("", "bLaH", false);

		// Both strings empty
        bAllPassed &= test("", "", true);

		// Another simple case
        bAllPassed &= test("abc", "", false);

        // Cases with repeating character sequences.
        bAllPassed &= test("abcccd", "", false);
        bAllPassed &= test("mississipissippi", "", false);
        bAllPassed &= test("xxxxzzzzzzzzyf", "", false);
        bAllPassed &= test("xxxxzzzzzzzzyf", "", false);
        bAllPassed &= test("xxxxzzzzzzzzyf", "", false);
        bAllPassed &= test("xxxxzzzzzzzzyf", "", false);
        bAllPassed &= test("xyxyxyzyxyz", "", false);
        bAllPassed &= test("mississippi", "", false);
        bAllPassed &= test("xyxyxyxyz", "", false);
        bAllPassed &= test("m ississippi", "", false);
        bAllPassed &= test("ababac", "", false);
        bAllPassed &= test("dababac", "", false);
        bAllPassed &= test("aaazz", "", false);
        bAllPassed &= test("a12b12", "", false);
        bAllPassed &= test("a12b12", "", false);
        bAllPassed &= test("a12b12", "", false);

        // A mix of cases
        bAllPassed &= test("n", "", false);
        bAllPassed &= test("aabab", "", false);
        bAllPassed &= test("ar", "", false);
        bAllPassed &= test("aar", "", false);
        bAllPassed &= test("XYXYXYZYXYz", "", false);
        bAllPassed &= test("missisSIPpi", "", false);
        bAllPassed &= test("mississipPI", "", false);
        bAllPassed &= test("xyxyxyxyz", "", false);
        bAllPassed &= test("miSsissippi", "", false);
        bAllPassed &= test("miSsissippi", "", false);
        bAllPassed &= test("abAbac", "", false);
        bAllPassed &= test("abAbac", "", false);
        bAllPassed &= test("aAazz", "", false);
        bAllPassed &= test("A12b12", "", false);
        bAllPassed &= test("a12B12", "", false);
        bAllPassed &= test("oWn", "", false);
        bAllPassed &= test("bLah", "", false);
        bAllPassed &= test("bLah", "", false);
    }

    if (bAllPassed)
    {
        printf("Passed\n");
    }
    else
    {
        printf("Failed\n");
    }

    return 0;
}

Improved Performance Across the Board

Figures 1 through 6 depict screenshots of three call graphs, with two figures per call graph. Each pair of figures reflects a performance improvement over my WildTextCompare() routine of 2014. In the first call graph, the testwild() code of Listing 3 is run in COMPARE_PERFORMANCE mode. This gets the code of Listing 1 to crunch all the wildcard test input for a million repetitions, along with the code of the 2014 routine, to compare the routines’ performance. Though the 2014 routine was optimized for wildcard comparison, the new FastWildCompare() routine comes in somewhat faster, thanks to the improved arrangement of special case logic.

In the second pair of screen shots, the testtame() input of Listing 4 likewise gets processed a million times. This input has no wildcards and instead comprises only tame strings of various lengths. As with the testwild() performance, the new routine is faster than the 2014 routine. But the biggest improvement is seen with the testempty() input of Listing 5.

It is with empty strings that the new routine saves significant time. A program that does a million passes through the testempty() input turns out to spend roughly as much time in testempty() itself as in either of the matching wildcard routines it invokes. Still, the new FastWildCompare() routine consumes about 6% less of the overall run time than the 2014 routine. The overall performance gain for empty strings is greater than 15%.

These are not exponential gains, but they improve on the already speedy performance of the existing WildTextCompare() code that was optimized with the assistance of a profiler and a pile of tests. Is this new algorithm of FastWildCompare() the definitive one to use? It’s easy to envision a different, perhaps more specific routine being a better fit in many situations. For example, there may be cases where every input string is known to contain a ‘*’ wildcard, or where strings of a certain length are expected, or where full regular expression handling is needed, or where strings may not be null-terminated. Those situations and others could each call for techniques of their own. A performance comparison check, such as the test() routine of Listing 3 in combination with a performance profiling tool, can help with selection of the right routine for a particular job.

A routine specialized for one situation may not readily adapt to others, or it may lack performance beyond its intended scope of use cases. For example, you probably wouldn’t want to use even the fastest routine for matching wildcards on strings that you don’t expect to contain wildcards. An ordinary string comparison algorithm would probably be a better choice, for that case. At the opposite extreme, when all the input is expected to involve wildcards, this new algorithm is just one of several fine candidates. But for a broad range of scenarios in the middle, particularly involving large amounts of sparse data that may contain an occasional wildcard, this new algorithm may be the one to pick.


The code listings above are the property of IBM Corporation, which makes them available for use by the software development community under the Apache v2 license. All other content is copyright © 2018 developforperformance.com.

Develop for Performance