Sorting algorithms are generally old hat to most programmers. They've either analyzed them to death in class, already written many of them, or work in a language where one is provided and they don't need to think about it. Or all three.
For PHP developers, we have a suite of sorting tools available to us: sort()
, usort()
, ksort()
, uasort()
, and various other letter combinations. All use Quick Sort internally, which is generally the best performing single-threaded option. Most importantly, many of them let us provide a custom comparison function to use when determining which of two values is larger or smaller.
However, that doesn't mean the last word has been spoken. Most sorting algorithms are all about picking what to compare to minimize the number of comparisons, not doing the actual comparison. Additionally, we don't always want to sort data by itself. Quite often, we need to sort records by some extrinsic value, such as a priority or before/after flags.
What are the performance characteristics of those options? Since my PSR-14 Event Dispatcher implementation, Tukio, supports both priorty and before/after logic, I wanted to see which was faster, and just how fast. The results were quite interesting!
Types of sorting
There are two main kinds of extrinsic sorting we're going to consider: Priority Sort and Topological Sort. Then we'll try combining them in various ways and see what happens.
If you want to follow along at home, the benchmarking code is available on GitHub. I've inlined the relevant highlights below. An important point is that for each sorting style, we maintain an internal array of Item
objects. The Item
object is an internal class (so just public properties, no methods) that holds the item value and its priority, or its before/after rules, or both. You'll see them at work in the code samples below.
Priority Sort
In a Priority Sort setup, each item to be sorted is given a priority integer value. When sorting, values are sorted by priority, with larger numbers coming first. That means a priority 5
item will come before a priority 3
item, but after a priority 8
item. If two items have the same priority, their relative order doesn't matter although getting them back in the same order they were originally is preferable.
There's another version of Priority Sort that calls the extra integer a "weight," and is the exact same thing but the lower numbers come first. Some systems use one, some use the other, but for our purposes they're interchangeable.
If you've ever used the Symfony EventDispatcher component, its "priority" values are exactly what we're talking about. (Tukio does the same thing.)
The basic algorithm for priority sort is the same as any other sort, but we compare the priority number associated with each item rather than the item itself.
The full code is here, but the important bits are simply this:
class PrioritySortNaive
{
// ...
/** @var array */
protected array $items;
public function sorted(): iterable
{
if (!$this->sorted) {
$this->sort();
$this->sorted = true;
}
return array_map(static fn (PriorityItem $item): mixed => $item->item, $this->items);
}
protected function sort(): array
{
// We want higher numbers to come first, hence the * -1 to invert it.
$sort = static fn (PriorityItem $a, PriorityItem $b): int
=> -1 * ($a->priority <=> $b->priority);
usort($this->items, $sort);
return $this->items;
}
}
It's pretty straightforward. The sort()
method just defers everything to PHP's own usort()
function, which sorts by the priority value on each item. (The -1
in there is to reverse the sort order so larger numbers come first.) The sorted()
method sorts if necessary, then uses array_map()
to easily get just the value out of each entry to return.
Topological Sorting
Topological Sorting is sorting a series of nodes that have a before/after relationship with other nodes. So node A
comes "before" node B
, node B
comes "after" node C
, etc. Topological sorting is somewhat more robust than Priority sorting, but has two additional costs:
- In order to specify nodes relative to each other, they all must have a unique ID that you can reference.
- There is the potential for "cycles," where
A
comes beforeB
,B
comes beforeC
, andC
comes beforeA
. That's impossible and becomes a failure condition we have to handle.
There are a couple of topological sort algorithms available. The most straightforward goes roughly like so:
- Normalize all before/after rules to be just "before" rules.
- Build a list of how many nodes point to (must be before) each other node.
- Find any nodes that have zero incoming "before" arrows, meaning those are the first (in any order). Save those into a result list.
- Decrement the inbound-before counters for every node pointed at by the nodes we just found.
- Repeat steps 3 and 4 until we run out of nodes, in which case the result list is now sorted, or no nodes have zero inbound-before counters but there are still some to process, in which case we've found a cycle and there is no possible sorting order.
I mostly cribbed my implementation from a Python example, but the logic is reasonably straightforward. Here's the full version and the relevant bits:
class TopSortBasic
{
// ...
/** @var array */
protected array $items;
protected ?array $sorted = null;
public function sorted(): iterable
{
return $this->sorted ??= array_map(fn (string $id): mixed
=> $this->items[$id]->item, $this->sort());
}
protected function sort(): array
{
$this->normalizeDirection();
// Compute the initial indegrees for all items.
$indegrees = array_fill_keys(array_keys($this->items), 0);
foreach ($this->items as $id => $node) {
foreach ($node->before as $neighbor) {
if (isset($this->items[$neighbor])) {
$indegrees[$neighbor]++;
}
}
}
// Find items with nothing that comes before it.
$usableItems = [];
foreach ($this->items as $id => $item) {
if ($indegrees[$id] === 0) {
$usableItems[] = $id;
}
}
// Because the items were pushed onto the usable list, we need
// to reverse it to get them back in the order they were added.
$usableItems = array_reverse($usableItems);
// Keep removing usable items until there are none left.
$sorted = [];
while (count($usableItems)) {
// Grab an available item. We know it's sorted.
$id = array_pop($usableItems);
$sorted[] = $id;
// Decrement the neighbor count of everything that item was before.
foreach ($this->items[$id]->before as $neighbor) {
if (isset($indegrees[$neighbor])) {
$indegrees[$neighbor]--;
if ($indegrees[$neighbor] === 0) {
$usableItems[] = $neighbor;
}
}
}
}
// We've run out of nodes with no incoming edges.
// Did we add all the nodes or find a cycle?
if (count($sorted) === count($this->items)) {
return $sorted;
}
throw new CycleFound();
}
protected function normalizeDirection(): void
{
foreach ($this->items as $node) {
foreach ($node->after ?? [] as $afterId) {
// If this item should come after something that doesn't exist,
// that's the same as no restrictions.
if ($this->items[$afterId]) {
$this->items[$afterId]->before[] = $node->id;
}
}
}
}
}
Again, sorted()
is just calling sort()
and then mapping over the result to get back just the values. The sort()
method is the algorithm we described a moment ago. normalizeDirection()
just converts all "after" rules into "before" rules so that we have only one direction to check.
Grouped Priority Sort
The way priority sort works, if two items have the same priority we don't care what their order is, or at least prefer them to be left alone (so first added remains first). That suggests a different implementation.
Instead of a flat list of Item
s with priorities, we have an associative array of Item
s where the key is the priority and the value is an array of items in the order they were added. Then we need only sort the arrays by their key, which should (we hope) be a far smaller number of items, and know that the whole list is now sorted enough. In theory, it should be a huge time savings.
Here's the full code and the fun parts:
class PrioritySortGrouped
{
// ...
/** @var array> */
protected array $items;
public function sorted(): iterable
{
$this->sort();
$it = (function () {
foreach ($this->items as $itemList) {
yield from array_map(static fn (PriorityItem $item) => $item->item, $itemList);
}
})();
return iterator_to_array($it, false);
}
protected function sort(): void
{
if (!$this->sorted) {
krsort($this->items);
$this->sorted = true;
}
}
}
The sorting part is now dead-simple. We just defer to PHP's provided krsort()
function, which sorts an array by its key values, in descending order (so that larger numbers come first, as desired). The sorted()
method gets slightly more complex, as we now need to iterate over the arrays within the arrays and pluck out the values at each point. I found this most convenient to do with a generator, but there are other options.
Internalized Topological Sort
On a lark, I wanted to see what would happen if I used the before/after rules as the comparison routine given to PHP's internal functions. Would that be any faster? Would it even work?
The code goeth thusly:
class TopSortInternal
{
protected function sort(): array
{
$this->normalizeDirection();
uasort($this->items, static function (TopologicalItem $left, TopologicalItem $right): int {
if (in_array($right->id, $left->before, true)) {
return -1;
}
if (in_array($left->id, $right->before, true)) {
return 1;
}
return 0;
});
return array_keys($this->items);
}
}
(sorted()
is the same as for TopSortBasic
.)
Do you think that would work?
Turns out, it doesn't. PHP... never actually compares every item to every other item in QuickSort, so some rules simply end up unchecked. That leads to invalid results. Even if it had worked, we would have lost cycle detection. We'll exclude this from our benchmarks, but it's still good to try new things like this and see if, and why, they fail.
Combined Sort (Priority)
What if we want to allow users of a library to sort either by priority, or by before/after rules? Can we combine these two sorting mechanisms?
We can, to an extent. Specifically, we can allow users to provide either a priority or a topological ordering rule, and then before sorting convert one into the other. We can tell just by thinking about it that there will be some added cost to doing so, since in either case we're doing the same sort as before, plus some conversion logic. But for some use cases maybe that's OK.
If we want to use priority sorting internally, that does mean we lose cycle detection. That may also be OK for many situations.
Here's what I came up with for a priority-centric combined sort:
class CombinedSortPriority
{
/** @var array> */
protected array $items;
/** @var CombinedItem[] */
protected array $toPrioritize = [];
/** @var array */
protected array $itemIndex = [];
protected function sort(): void
{
$this->normalizeDirection();
// Convert any before/after items to priorities.
$this->prioritizePendingItems();
if (!$this->sorted) {
krsort($this->items);
$this->sorted = true;
}
}
protected function prioritizePendingItems(): void
{
// If there are no prioritized items at all yet, pick one
// and declare it priority 0. (The last one is the fastest
// to access and remove from the list, so do that.)
if (empty($this->items)) {
$item = array_pop($this->toPrioritize);
$this->items[$item->priority][$item->id] = $item;
}
foreach ($this->toPrioritize as $item) {
// Get the priority of all of the items this one must come before.
$otherPriorities = array_map(fn(string $before): int
=> $this->itemIndex[$before]->priority, $item->before);
// This seems backwards, but that's because we want higher numbers
// to come first. So something that comes "before" another item
// needs to have a higher priority than it.
$priority = max($otherPriorities) + 1;
$item->priority = $priority;
$this->items[$priority][] = $item;
}
// We never need to reprioritize these again.
$this->toPrioritize = [];
}
}
There's more bookkeeping going on, although because the objects are never duplicated the overall cost shouldn't too high. sorted()
is the same as in Grouped Priority Sort, which is what I built this example off of, so I didn't repeat it. The key is that we keep a separate list of items that have a priority and items that have a topological order. Then before sorting, we go through the topological list and convert all the before/after rules into priorities. To do so, we get the priority of every item it comes before, get the maximum priority of those, and add one to it. That guarantees that the item will sort before the others. We can then add it to the priority list and sort everything exactly as before.
Combined Sort (Topological)
The general concept for a Topological-centric combined sort is basically the same, but the other way around. We keep two separate lists, then convert the priority-based items into before-based items right before sorting.
Here's my first attempt at that:
protected function sort(): array
{
$this->normalizeDirection();
$this->topologizePendingItems();
// The same as before.
}
protected function topologizePendingItems(): void
{
foreach ($this->toTopologize as $priority => $items) {
foreach ($items as $item) {
foreach ($this->toTopologize as $otherPriority => $otherItems) {
// For every other pending item, if this item's priority is
// greater than it, that means it must come before it. Mark
// the item as such.
if ($priority > $otherPriority) {
$add = array_map(static fn(CombinedItem $i) => $i->id, $otherItems);
$item->before = [...$item->before, ...array_values($add)];
}
$this->items[$item->id] = $item;
}
}
}
// We don't need to reprioritize these again.
$this->toTopologize = [];
}
}
In this case I am again storing the prioritized items grouped by priority. Then we iterate over those (which thus takes two foreach()
loops, but is still only an O(n) operation) and for each one, add everything with a lower priority to its before list. The result is guaranteed to be a valid representation of the result we want, which can then be topologically sorted as before.
I did not actually benchmark this version. Or rather, I tried to and it never finished. Can you see why?
Because this is an O(n^2) algorithm! We're comparing every priority-based item against every other priority-based item. That's very bad on performance. In addition, we're adding way more information than we need to. If there are 5000 items, do we really need to say that a priority 100 item comes before all other 4999 items? No, we don't, but that would likely add more runtime to the topological sort routine.
So let's toss out that version, and instead do this:
class CombinedSortTop
{
protected function topologizePendingItems(): void
{
// First, put the priorities in order, low numbers first.
ksort($this->toTopologize);
while (count($this->toTopologize)) {
// Get the highest priority set. That's the last item in the
// list, which is fastest to access.
$items = array_pop($this->toTopologize);
// We don't actually care what the next priority is, but need it
// as a lookup value to get the items in that priority.
$otherPriority = array_key_last($this->toTopologize);
/** @var CombinedItem $item */
foreach ($items as $item) {
// If $otherPriority is null, it means this is the last priority set
// so there is nothing else it comes before.
if ($otherPriority) {
$item->before = array_map(static fn(CombinedItem $i)
=> $i->id, $this->toTopologize[$otherPriority]);
}
$this->items[$item->id] = $item;
}
}
}
}
In this version, we take advantage of the fact that we have the priorities already grouped. We first sort by priority, with low numbers first. We then pop items off the end (so starting with the highest priority) and say that all of those items must come before every item in the next priority after that, but we don't care about anything in priorities further down. That becomes implicit.
This approach was far faster, fast enough that I could benchmark it effectively. However, there's still a problem. Can you spot it?
The Results
Before we look at the data, look back at the code above. What do you expect the results to be? Which algorithm do you expect to perform best, and in what situations? Why? Ponder that a moment before continuing.
Tests were run on my local laptop, a 7th Gen Lenovo Thinkpad X1 Carbon with 16 GB of RAM running Kubuntu 22.04. Obviously if you have a different system your absolute results will be different but the relative values should be about the same. All tests were built with PHPBench. I used a list size of 100,000, because if I set it higher the benchmarks just took too damned long.
Priority Sort Naive
I ran four test on the naive priority sort.
- PrioritySortNaiveUniquePreSortedBench - All items have a unique priority, and they com