I'd like to hear that story, when you get the chance. ;-)

Karrot_Kream · on April 6, 2017

Sure and thanks for the wait!

To give a bit of background: I run a service for some folks that allows them to get status messages. Some of my users wanted stats on the kinds of messages they received. My first implementation was quick and dirty: shell scripts which would run filters and aggregations through combinations of grep, sort, and uniq. Eventually as more demands came in with different types of functionality, this became unscalable, so I wrote an analysis engine in Prolog.

Problem: Find status lines that have valid "http" links in them.

First attempt: SWI Prolog contains a goal sub_string/5 (http://www.swi-prolog.org/pldoc/man?predicate=sub_string/5). The goal is invoked as: sub_string(+String, ?Before, ?Length, ?After, ?SubString). I was storing status messages in the Prolog database wrapped in the "status" dynamic goal, so a status message looked like: `status("hello world")`. At a Prolog REPL, to query for all status messages, I could match along the goal `status(X)` and all bound instances of X would be the messages.

I created a DCG and wrapped it in a goal called `parse_http_string(String)` which would evaluate to true if `String` was a valid HTTP string. My first attempt was a goal like:

`status(String), sub_string(String, _, _, _, Needle), parse_http_string(Needle) `

This is an O(n^2) query. We first bind `String` to a status message, and the `sub_string` predicate will match _all valid substrings_ of `String` and stuff it into `Needle`. `parse_http_string/1` will then run on `Needle` and return true when the goal succeeds on a substring.

This was slow, but the query wasn't used very often and it worked for my customers.

Second Attempt: I realized that all HTTP links started with the text "http". By searching the status message for the string "http" and then starting the substring search at the starting index at the beginning of this match, I'd only be doing an O(n) search of substrings (because I'm only varying the end index of `sub_string/1` rather than both the start and end index). The modified goal looked like the following:

`status(String), sub_string(String, HttpStart, _, _, "http"), sub_string(String, HttpStart, _, Needle), parse_http_string(Needle) `

This produced the expected O(n) algorithm, despite only adding a single line and modifying the next line. This also made the query a lot faster, so I could rest easier when my customers began to look for links en masse in their status messages! (Which they did, as surely as any product you hand to users ever scales).

gugagore · on April 6, 2017

Just to clarify, it was O(n^2) for n the length of the string, not the number of statuses. Right?

Karrot_Kream · on April 6, 2017

Correct.

carapace · on April 7, 2017

Cheers. :-)

drunkpotato · on April 6, 2017

Me too, please post!