Post Ranking Algorithm

[[TOC]]

Current Design

Rank = $ScaleFactor * log(Max(1, 3 + Score)) / (Time + 2)^{Gravity}$

Score = $Upvotes - Downvotes$

Time = time since submission (in hours) Gravity = Decay gravity, 1.8 is the default

• For posts, in order to bring up active posts, it uses the latest comment time (limited to a max creation age of a month ago)
• Use Max(1, score) to make sure all comments are affected by time decay.
• Add 3 to the score, so that everything that has less than 3 downvotes will seem new. Otherwise, all new comments would stay at zero, near the bottom.
• The sign and abs of the score are necessary for dealing with the log of negative scores.
• A scale factor of 10k gets the rank in integer form.

A plot of rank over 24 hours, of scores of 1, 5, 10, 100, 1000, with a scale factor of 10k.

Implimentation

This is implemented as a hot_rank function in the database that is called in the post_view and comment_view

hot_rank Function

create or replace function hot_rank(
score numeric,
published timestamp without time zone)
returns integer as $$begin -- hours_diff:=EXTRACT(EPOCH FROM (timezone('utc',now()) - published))/3600 return floor(10000*log(greatest(1,score+3)) / power(((EXTRACT(EPOCH FROM (timezone('utc',now()) - published))/3600) + 2), 1.8))::integer; end;$$
LANGUAGE plpgsql;


This function is called in the post_aggregates_view:

post_aggregates_view Query

post_aggregates_view SQL
create view post_aggregates_view as
select
p.*,
-- creator details
u.actor_id as creator_actor_id,
u."local" as creator_local,
u."name" as creator_name,
u.published as creator_published,
u.avatar as creator_avatar,
u.banned as banned,
cb.id::bool as banned_from_community,
-- community details
c.actor_id as community_actor_id,
c."local" as community_local,
c."name" as community_name,
c.removed as community_removed,
c.deleted as community_deleted,
c.nsfw as community_nsfw,
-- post score data/comment count
coalesce(pl.score, 0) as score,
hot_rank(
coalesce(pl.score , 0), (
case
when (p.published < ('now'::timestamp - '1 month'::interval))
then p.published
else greatest(ct.recent_comment_time, p.published)
end
)
) as hot_rank,
(
case
when (p.published < ('now'::timestamp - '1 month'::interval))
then p.published
else greatest(ct.recent_comment_time, p.published)
end
from post p
left join community_user_ban cb on p.creator_id = cb.user_id and p.community_id = cb.community_id
left join community c on p.community_id = c.id
left join (
select
post_id,
max(published) as recent_comment_time
from comment
group by post_id
) ct on ct.post_id = p.id
left join (
select
post_id,
sum(score) as score,
sum(score) filter (where score = 1) as upvotes,
-sum(score) filter (where score = -1) as downvotes
from post_like
group by post_id
) pl on pl.post_id = p.id
order by p.id;


and the comment_aggregates_view:

comment_aggregates_view Query

comment_aggregates_view SQL
create view comment_aggregates_view as
select
ct.*,
-- post details
p."name" as post_name,
p.community_id,
-- community details
c.actor_id as community_actor_id,
c."local" as community_local,
c."name" as community_name,
-- creator details
u.banned as banned,
coalesce(cb.id, 0)::bool as banned_from_community,
u.actor_id as creator_actor_id,
u.local as creator_local,
u.name as creator_name,
u.published as creator_published,
u.avatar as creator_avatar,
-- score details
coalesce(cl.total, 0) as score,
hot_rank(coalesce(cl.total, 0), ct.published) as hot_rank
from comment ct
left join post p on ct.post_id = p.id
left join community c on p.community_id = c.id
left join community_user_ban cb on ct.creator_id = cb.user_id and p.id = ct.post_id and p.community_id = cb.community_id
left join (
select
l.comment_id as id,
sum(l.score) as total,
count(case when l.score = 1 then 1 else null end) as up,
count(case when l.score = -1 then 1 else null end) as down
from comment_like l
group by comment_id
) as cl on cl.id = ct.id;


The Important Points

in the comment_aggregates_view

the Published Time is the comment published date

in the post_aggregates_view the Published time is treated as the most recent comment published unless the thread is older than 1 month

There is one more point. The post_aggregates_view is not directly used. there is an intermediate table that is updated from this view when a comment is made via the trigger refresh_comment the trigger cuts off the updating at 1 week of age and then forces the rank to $0$

Proposals to modify

Using Average Comment Age

This is what that would look like

Using An Exponential Decay

Modeling for an exponential decay for the effect new comments have on a posts' ranking

the decay factor is calculated as

  \text{published-adjustment} = {\text{decay-hours}} \times {( 1 - e ^{{(ln(0.8) \times { 3600 \times \text{factor-hours}})} \times \text{time-since-first-published}}})


This results in a much nicer decay of post rank as comments accumulate

Note that the two lines labeled just "Score: 100" and "Score: 1000" are drawn assuming that no new comments are received. any new comment would reset the published age under the current algorithm.

A target decay_hours of 24 with factor_hours of 6 leads to a nice prolonging of rank that isn't the flat bar of the current algorithm.

Future Development

Now that the immediate problem is solved, we can take some time to fully explore possible solutions for future changes. Part of that is creating models for predicted patterns of behavior, in particular patterns which fall outside of typical norms, and testing how different algorithms act under these conditions. Among other goals, this page aims to catalogue possibilities for these models.

Activity Patterns

'Novelty' communities

Small communities which occasionally see a massive influx of traffic when they are linked in a larger community. This overlaps heavily and is perhaps even indistinguishable from the standard activity pattern of any niche community—goals here should be to avoid disrupting typical community activity while still ensuring that new visitors see the content they're expecting.

POSSIBLE SOLUTIONS: Standard log sorting should handle this fairly well. This is likely to be a major factor in selecting constants for log decay and vote gravity, as it will be more sensitive to changes than more normal behavior.

'Fandom' communities

And any other community which is centered around an external source rather than its own content. An archetypal example would be a community discussing a TV show, with weekly posts that receive lots of activity throughout the week but also smaller posts throughout the week that follow typical activity patterns. This could also apply to daily/weekly discussion threads or similar, which would see less of an initial spike.

POSSIBLE SOLUTIONS: Before anything else, the pre-existing solution of simply stickying these posts should be mentioned. It may be foolish to even attempt to account for these posts in a generalized algorithm.

If we do try to, though, one way of approaching the problem could be to boost posts based on activity compared to a dynamic community norm. Posts which receive a disproportionate amount of sustained activity are granted an increased amount of staying power in the ranking system, perhaps separately rather than in combination with post votes.

'News' communities

While most activity will follow site norms, occasional very-prominent stories will draw near-constant traffic for a sustained amount of time and dominate discussion for this duration. This also matches some sports forum activity, analogous to live-threads. Another thing to consider in this category is posts which receive sporadic or singular updates adding new facts and bringing a new surge in activity; we ideally want to allow these updates to return a post to prominence without overwhelming the usual ranking pattern.

POSSIBLE SOLUTIONS: The above solution of comparing activity to community norms would also help here, although as this strategy becomes generalized it becomes more necessary to determine a specific approach.

Statistical modeling & testing

code available in snippets