7 Post Ranking Algorithm
ryexandrite edited this page 10 months ago

Table of Contents


Current Design

Rank = $ScaleFactor * log(Max(1, 3 + Score)) / (Time + 2)^{Gravity}$

Score = $Upvotes - Downvotes$

Time = time since submission (in hours) Gravity = Decay gravity, 1.8 is the default

  • For posts, in order to bring up active posts, it uses the latest comment time (limited to a max creation age of a month ago)
  • Use Max(1, score) to make sure all comments are affected by time decay.
  • Add 3 to the score, so that everything that has less than 3 downvotes will seem new. Otherwise, all new comments would stay at zero, near the bottom.
  • The sign and abs of the score are necessary for dealing with the log of negative scores.
  • A scale factor of 10k gets the rank in integer form.

A plot of rank over 24 hours, of scores of 1, 5, 10, 100, 1000, with a scale factor of 10k.


This is implemented as a hot_rank function in the database that is called in the post_view and comment_view

hot_rank Function

create or replace function hot_rank(
  score numeric,
  published timestamp without time zone)
returns integer as $$
  -- hours_diff:=EXTRACT(EPOCH FROM (timezone('utc',now()) - published))/3600
  return floor(10000*log(greatest(1,score+3)) / power(((EXTRACT(EPOCH FROM (timezone('utc',now()) - published))/3600) + 2), 1.8))::integer;
end; $$
LANGUAGE plpgsql;

This function is called in the post_aggregates_view:

post_aggregates_view Query

`post_aggregates_view` SQL
create view post_aggregates_view as
	-- creator details
	u.actor_id as creator_actor_id,
	u."local" as creator_local,
	u."name" as creator_name,
  u.published as creator_published,
	u.avatar as creator_avatar,
  u.banned as banned,
  cb.id::bool as banned_from_community,
	-- community details
	c.actor_id as community_actor_id,
	c."local" as community_local,
	c."name" as community_name,
	c.removed as community_removed,
	c.deleted as community_deleted,
	c.nsfw as community_nsfw,
	-- post score data/comment count
	coalesce(ct.comments, 0) as number_of_comments,
	coalesce(pl.score, 0) as score,
	coalesce(pl.upvotes, 0) as upvotes,
	coalesce(pl.downvotes, 0) as downvotes,
		coalesce(pl.score , 0), (
				when (p.published < ('now'::timestamp - '1 month'::interval))
				then p.published
				else greatest(ct.recent_comment_time, p.published)
	) as hot_rank,
			when (p.published < ('now'::timestamp - '1 month'::interval))
			then p.published
			else greatest(ct.recent_comment_time, p.published)
	) as newest_activity_time
from post p
left join user_ u on p.creator_id = u.id
left join community_user_ban cb on p.creator_id = cb.user_id and p.community_id = cb.community_id
left join community c on p.community_id = c.id
left join (
		count(*) as comments,
		max(published) as recent_comment_time
	from comment
	group by post_id
) ct on ct.post_id = p.id
left join (
		sum(score) as score,
		sum(score) filter (where score = 1) as upvotes,
		-sum(score) filter (where score = -1) as downvotes
	from post_like
	group by post_id
) pl on pl.post_id = p.id
order by p.id;

and the comment_aggregates_view:

comment_aggregates_view Query

`comment_aggregates_view` SQL
create view comment_aggregates_view as
	-- post details
	p."name" as post_name,
	-- community details
	c.actor_id as community_actor_id,
	c."local" as community_local,
	c."name" as community_name,
	-- creator details
	u.banned as banned,
  coalesce(cb.id, 0)::bool as banned_from_community,
	u.actor_id as creator_actor_id,
	u.local as creator_local,
	u.name as creator_name,
  u.published as creator_published,
	u.avatar as creator_avatar,
	-- score details
	coalesce(cl.total, 0) as score,
	coalesce(cl.up, 0) as upvotes,
	coalesce(cl.down, 0) as downvotes,
	hot_rank(coalesce(cl.total, 0), ct.published) as hot_rank
from comment ct
left join post p on ct.post_id = p.id
left join community c on p.community_id = c.id
left join user_ u on ct.creator_id = u.id
left join community_user_ban cb on ct.creator_id = cb.user_id and p.id = ct.post_id and p.community_id = cb.community_id
left join (
		l.comment_id as id,
		sum(l.score) as total,
		count(case when l.score = 1 then 1 else null end) as up,
		count(case when l.score = -1 then 1 else null end) as down
	from comment_like l
	group by comment_id
) as cl on cl.id = ct.id;

The Important Points

in the comment_aggregates_view

the Published Time is the comment published date

in the post_aggregates_view the Published time is treated as the most recent comment published unless the thread is older than 1 month

There is one more point. The post_aggregates_view is not directly used. there is an intermediate table that is updated from this view when a comment is made via the trigger refresh_comment the trigger cuts off the updating at 1 week of age and then forces the rank to $0$

Proposals to modify

Using Average Comment Age

This is what that would look like

avg comment age

Using An Exponential Decay

Modeling for an exponential decay for the effect new comments have on a posts' ranking

the decay factor is calculated as

  \text{published-adjustment} = {\text{decay-hours}} \times {( 1 - e ^{{(ln(0.8) \times { 3600 \times \text{factor-hours}})} \times \text{time-since-first-published}}})

This results in a much nicer decay of post rank as comments accumulate

exp decay

Note that the two lines labeled just "Score: 100" and "Score: 1000" are drawn assuming that no new comments are received. any new comment would reset the published age under the current algorithm.

A target decay_hours of 24 with factor_hours of 6 leads to a nice prolonging of rank that isn't the flat bar of the current algorithm.

Future Development

Now that the immediate problem is solved, we can take some time to fully explore possible solutions for future changes. Part of that is creating models for predicted patterns of behavior, in particular patterns which fall outside of typical norms, and testing how different algorithms act under these conditions. Among other goals, this page aims to catalogue possibilities for these models.

Activity Patterns

'Novelty' communities

Small communities which occasionally see a massive influx of traffic when they are linked in a larger community. This overlaps heavily and is perhaps even indistinguishable from the standard activity pattern of any niche community—goals here should be to avoid disrupting typical community activity while still ensuring that new visitors see the content they're expecting.

POSSIBLE SOLUTIONS: Standard log sorting should handle this fairly well. This is likely to be a major factor in selecting constants for log decay and vote gravity, as it will be more sensitive to changes than more normal behavior.

'Fandom' communities

And any other community which is centered around an external source rather than its own content. An archetypal example would be a community discussing a TV show, with weekly posts that receive lots of activity throughout the week but also smaller posts throughout the week that follow typical activity patterns. This could also apply to daily/weekly discussion threads or similar, which would see less of an initial spike.

POSSIBLE SOLUTIONS: Before anything else, the pre-existing solution of simply stickying these posts should be mentioned. It may be foolish to even attempt to account for these posts in a generalized algorithm.

If we do try to, though, one way of approaching the problem could be to boost posts based on activity compared to a dynamic community norm. Posts which receive a disproportionate amount of sustained activity are granted an increased amount of staying power in the ranking system, perhaps separately rather than in combination with post votes.

'News' communities

While most activity will follow site norms, occasional very-prominent stories will draw near-constant traffic for a sustained amount of time and dominate discussion for this duration. This also matches some sports forum activity, analogous to live-threads. Another thing to consider in this category is posts which receive sporadic or singular updates adding new facts and bringing a new surge in activity; we ideally want to allow these updates to return a post to prominence without overwhelming the usual ranking pattern.

POSSIBLE SOLUTIONS: The above solution of comparing activity to community norms would also help here, although as this strategy becomes generalized it becomes more necessary to determine a specific approach.

Statistical modeling & testing

code available in snippets