I’d love to hear other people’s opinions on my post. So I’m adding the ability
to post and view comments to this blog.
There are solutions that streamline the integration of comments, with
Disqus being the oldest service I know. However, I
haven’t had good experience with Disqus as an audience in other blogs. Disqus
always took very long to load the comments, sometimes failing altogether, and
their UI is also not quite my taste.
Alternatives like giscus or
utterances use GitHub to authenticate user and
store the underlying comments data. But I don’t feel so comfortable letting my
website depend on a makeshift solution like that. I want my own website to last.
Without a clear migration plan for existing comments, the longevity of these
solutions becomes uncertain. Since GitHub didn’t create GitHub Discussions to be
used like that, they don’t have the responsibility to keep things the way you
want forever. Additionally, I aim to keep the comment creation process on my
blog as easy as possible, requiring no login—just enter your name.
Given my existing API written in Rust (yes, I’m on the hype train too) and a
PostgreSQL database used to increase post views, I better make more use of that.
So I decided to create my own comment platform. This means I have to address
various aspects (not supposed to be an exhaustive list):
Data modelling and APIs to GET/POST comments
UI for displaying and submitting comments
Markdown rendering support and content sanitization
Content moderation and spam prevention
This post will delve into the first half of the data modeling process, covering
the data schema, data fetching, and the implementation to rank and fetch
comments for display.
Data modelling
Modelling involves designing database schemas after researching a suitable model
for storage and easy querying. To sum up, I went for the adjencency list model.
Each comment is a record in the database containing comment data and a
parent_id pointing to the comment it’s replying to. If parent_id is NULL,
it is a root comment. To simplify things, there’s no depth limit, and it’s easy
to enforce this later on the application level. Below is the schema in Prisma,
along with a corresponding SQL migration script.
To simplify things again, we’re just going to sort the comments by upvotes and
the timestamp, but the nested replies have to be sorted too. That means we have
to sort every level of every comment replies. It’s too complex (for me) to get a
final result using only the database queries. So what we’re doing in SQL is to
only sort the top level comments for pagination, and then get all the child
comments. This is Hacker News-like thread fetching. Sorting nested comments is
deferred to the Rust API.
This SQL snippet draws significant inspiration from a Stack Overflow response
that I regrettably couldn’t relocate. Executing the SQL query, we’ll have a list
of sorted comments. Any comment is guaranteed to be placed behind their parent
comment in the list. The SQL query employs recursive common table expressions
(CTE) for this purpose, ensuring comments are ordered based on their
hierarchical structure.
parent_id
id
author_name
content
root
level
created_at
upvote
NULL
1
wonrax
1.
{1}
0
2023-10-25
0
1
2
wonrax
1.1.
{1,2}
1
2023-10-25
0
2
3
wonrax
1.1.1.
{1,2,3}
2
2023-10-26
0
3
6
wonrax
1.1.1.1.
{1,2,3,6}
3
2023-10-26
0
1
4
wonrax
1.2.
{1,4}
1
2023-10-26
1
1
5
wonrax
1.3.
{1,5}
1
2023-10-26
2
NULL
7
wonrax
2.
{7}
0
2023-10-26
0
7
8
wonrax
2.1.
{7,8}
1
2023-10-26
0
Now we need to rank the children of the top level comments in the application
level. The end result we want is similar to the one above, but all the comments
in the same level are also sorted by upvote and timestamp.
Algorithms
Method 1: Recursively iterate through the list
Our recursive iterative ranking function receives a list of comments as input
and outputs a sorted list. Firstly, it will do a loop of:
Find all children of the current comment - this takes O(n) where n is
number of children of the current comment. We do this by incrementing the index
until we get a node whose level is equal to or higher than the current node.
Call this node next and forget about it for now.
Call our own recursive function with all of its children. This function
returns the sorted children.
Skip until next node and repeat the process or stop if we reach the end of
the list.
After we sort all children of our top level comments, we need to sort the top
level comments themselves. While finding the next (top level) nodes, we also
store the indices of the these nodes in a seperate list. Now we sort the top
level nodes and emplace their children to this new list - O(m⋅log(m))
where m is number of top level comments, according to Rust’s built-in
sort_unstable_by_key function.
We finish the function by returning this new list.
with a being the depth of the reply tree. If a is unlimited and we assume
the worst case which the branching factor being a constant n, the complexity
is simplified to:
O(nn+nn⋅log(n))
Method 2: Use an intermediate tree
For this method, first we iterate through the list and construct a comment tree.
This is O(n) with the help of a hash map.
Having the tree, we can just recursively sort the children. This is also
O(nn) if the reply depth is unlimited and assuming the branching factor is a
constant n.
Finally, we turn the tree back into list by doing a depth first search, which
has the complexity of O(n).
It seems the second method could be a bit faster despite both being O(nn),
and easier to reason about and implement too. But for the sake of learning Rust,
and to verify my calculation and reasoning, I implemented both solutions and did
a benchmark. Below is a table showing the time each implementation took in
average for input of n comments.
Number of comments
Iterative recursive
Intermediate tree
10
1.3687 µs
1.6782 µs
1000
252.51 µs
201.47 µs
10000
3.3283 ms
1.9920 ms
100000
61.469 ms
27.919 ms
For a relatively small number of comments, the iterative recursive
implementation is faster probably due to the fact that it doesn’t need to
construct a new tree and converting it back to list, which has more overhead
than the algorithm itself. But as n increases, we can see the intermediate
tree implementation is the clear winner here.
Does it mean that my reasoning and calculation is correct? No, far from that,
I’m afraid. But I feel more confident about my ability to approximate (or in
academic term, asymptotically analyze) worst-case complexity of an algorithm.
One thing that I wish I could do to make this post more complete is space
complexity, or memory usage of these two algorithms. But I’m too lazy.
The code
I use the criterion crate for
benchmarking. I think both implementations can be further optimized if I really
try. Like the iterative_recursive could use slices instead of Vec so it
doesn’t need to allocate new ones. But until this becomes an actual problem,
I’ll optimize it.