feature/keyword-filtering-main #296

Open
zongor wants to merge 11 commits from zongor/lemmy-hexbear:feature/keyword-filtering-main into main
zongor commented 1 year ago
There is no content yet.
zongor added 3 commits 1 year ago
Poster

ready for review

ready for review
Poster
Owner

Lgtm. We should hold off on the merge until we can get an endpoint to change this + some frontend functionality though.

Lgtm. We should hold off on the merge until we can get an endpoint to change this + some frontend functionality though.
zongor added 1 commit 1 year ago
zongor added 1 commit 1 year ago
zongor added 1 commit 1 year ago
ryexandrite requested changes 1 year ago
Dismissed
server/lemmy_api_structs/src/user.rs Outdated
pub has_2fa: bool,
pub auth: String,
pub inbox_disabled: bool,
pub filter_keywords: Vec<String>,
Poster
Owner

should this perhaps be an Option<Vec<String>> ?

This would let setting an empty array clear the keywords but setting a null/None would just not modify the keywords.
I believe this would make workgin with this API slightly easier as you don;t need to know all the keywords to edit the settings non destructivly

should this perhaps be an `Option<Vec<String>>` ? This would let setting an empty array clear the keywords but setting a null/None would just not modify the keywords. I believe this would make workgin with this API slightly easier as you don;t need to know all the keywords to edit the settings non destructivly
zongor commented 1 year ago
Poster

The original spec called for a Vec<string>. I think that this would be a good change, will do.

The original spec called for a Vec\<string\>. I think that this would be a good change, will do.
ryexandrite marked this conversation as resolved
server/lemmy_db/src/schema.rs Outdated
table! {
hexbear.user_filter (user_id) {
user_id -> Int4,
keywords -> Array<Text>,
Poster
Owner

I'm not seing a place where this type is defined. do we know for sure that this is the Diesel PG Array type and will get mapped correctly?

https://docs.diesel.rs/diesel/pg/types/sql_types/struct.Array.html

I'm not seing a place where this type is defined. do we know for sure that this is the Diesel PG Array type and will get mapped correctly? https://docs.diesel.rs/diesel/pg/types/sql_types/struct.Array.html
zongor commented 1 year ago
Poster

this was auto generated by diesel using the "migrate", so I would assume it is correct?

this was auto generated by diesel using the "migrate", so I would assume it is correct?
Poster
Owner

ok, I think we can trust it in that respect if things compile.

ok, I think we can trust it in that respect if things compile.
ryexandrite marked this conversation as resolved
server/lemmy_db/src/user_view.rs Outdated
RegexBuilder::new(i)
.case_insensitive(true)
.build()
.expect("Invalid Regex")
Poster
Owner

I don't like that this can panic if the user enters something that can break the regex. can you use map error to ignore bad regex?

I don't like that this can panic if the user enters something that can break the regex. can you use map error to ignore bad regex?
zongor commented 1 year ago
Poster

Dug around in the rust docs and found 'filter_map' which should do this.

Dug around in the rust docs and found 'filter_map' which should do this.
ryexandrite marked this conversation as resolved
-- This file should undo anything in `up.sql`
Poster
Owner

Is this migration generated to run after all current migrations? I know dash added some fields to the user view reciently and we need to be sure this migraiton takes thouse into account.

Is this migration generated to run after all current migrations? I know dash added some fields to the user view reciently and we need to be sure this migraiton takes thouse into account.
zongor commented 1 year ago
Poster

No, this is very old. I will try to create a new one.

No, this is very old. I will try to create a new one.
zongor commented 1 year ago
Poster

I double chechked it and from what I can tell the only newer migration has changes in unrelated tables, so this one should be fine. If not ill try and figure somthing out.

I double chechked it and from what I can tell the only newer migration has changes in unrelated tables, so this one should be fine. If not ill try and figure somthing out.
ryexandrite marked this conversation as resolved
server/src/api/user.rs Outdated
use lemmy_api_structs::{user::*, APIError};
use lemmy_db::{
comment::*,
Poster
Owner

These formating changes can really mess up merges and blame, can you run cargo fmt and fix these?

These formating changes can really mess up merges and blame, can you run `cargo fmt` and fix these?
zongor commented 1 year ago
Poster

I had to pass in '+nightly' for my rust fmt to work, will push with fixed.

I had to pass in '+nightly' for my rust fmt to work, will push with fixed.
ryexandrite marked this conversation as resolved
server/src/api/user.rs Outdated
None => read_user.password_encrypted,
};
let keywords = data.filter_keywords.clone();
Poster
Owner

can there be a check here to ensure that the current keywords will compile to a valid regex?
it can be done nievely I suppose, checking for special chars that may effect how the regex builds. and either removing them or reporting an error.

alternitivly the special chars need to be excaped before they are entered into the database / before they are used (never blindly trust user data). this means they may need to be un excaped before presenting them to the user or they might get re-excaped on a submit.

can there be a check here to ensure that the current keywords will compile to a valid regex? it can be done nievely I suppose, checking for special chars that may effect how the regex builds. and either removing them or reporting an error. alternitivly the special chars need to be excaped before they are entered into the database / before they are used (never blindly trust user data). this means they may need to be un excaped before presenting them to the user or they might get re-excaped on a submit.
zongor commented 1 year ago
Poster

Im going to check and filter against the regex, if it returns "ok" then ill send it to the db.

Im going to check and filter against the regex, if it returns "ok" then ill send it to the db.
Poster
Owner

we will need to do some testing to make sure this can't be abused we dont; want someone constructing an obsurdly expensive regex.

if you can, please take a look at https://docs.rs/regex/1.4.5/regex/#untrusted-input

and make sure there is a midication inplace for simple exploites like
a{100}{100}{100}

we will need to do some testing to make sure this can't be abused we dont; want someone constructing an obsurdly expensive regex. if you can, please take a look at https://docs.rs/regex/1.4.5/regex/#untrusted-input and make sure there is a midication inplace for simple exploites like `a{100}{100}{100}`
ryexandrite marked this conversation as resolved
zongor added 1 commit 1 year ago
ryexandrite requested changes 1 year ago
ryexandrite left a comment

Looking good, I think there needs to be some final changes to how the comment filter is done for efficiency but this is otherwise looking good.

server/src/api/post.rs Outdated
// filter out bad ids and their children by the filter keywords
let mut bad_comment_ids = Vec::new();
for c in sorted_comments.iter() {
Poster
Owner

This loop seems inefficient to me.

it loops over every comment even if there are no regex filters to apply. There should be a check for if the regex vector is empty before the there is a loop to apply the filter.

adtionaly, every regex filter is applied even if a mtatching one is encoutered. this can have comments appended to the bad ids multipel times.

also, there is a loop over every bad comment id to check for bad parrents even if the comment may be excluded on filter grounds. The order here, I think, is reversed. if we check for a bad parrent first we may not need to use the regex filter.

I believe we can intelegently apply rust's std::iter::Iterator::any here and do some short circuiting

let mut bad_comment_ids = Vec::new();
if !regex.is_empty() {
  for c in sorted_comments.iter() {
    // short-circuit check for an excluded parrent first so as to not run unnessacery regex
    // any short-circuit's as well
    if (c.parent_id.is_some() && bad_comment_ids.contains(&c.parent_id.unwrap())) || 
       (regex.iter().any(|f| f.is_match(c.content.as_ref())))
    {  
      bad_comment_ids.push(c.id);
    } 
  }
}
This loop seems inefficient to me. it loops over every comment even if there are no regex filters to apply. There should be a check for if the regex vector is empty before the there is a loop to apply the filter. adtionaly, every regex filter is applied even if a mtatching one is encoutered. this can have comments appended to the bad ids multipel times. also, there is a loop over every bad comment id to check for bad parrents even if the comment may be excluded on filter grounds. The order here, I think, is reversed. if we check for a bad parrent first we may not need to use the regex filter. I believe we can intelegently apply rust's `std::iter::Iterator::any` here and do some short circuiting ```rs let mut bad_comment_ids = Vec::new(); if !regex.is_empty() { for c in sorted_comments.iter() { // short-circuit check for an excluded parrent first so as to not run unnessacery regex // any short-circuit's as well if (c.parent_id.is_some() && bad_comment_ids.contains(&c.parent_id.unwrap())) || (regex.iter().any(|f| f.is_match(c.content.as_ref()))) { bad_comment_ids.push(c.id); } } } ```
ryexandrite marked this conversation as resolved
server/src/api/post.rs Outdated
}
}
let filtered_comments: Vec<CommentView> = comments
Poster
Owner

This itterator too may run when it is not needed. we only need to iterate if there are comments to filter. a check for bad_comment_ids.is_empty() would be best.

let filtered_comments: Vec<CommentView> = if bad_comment_ids.is_empty() {
  comments.clone()
} else {
  comments.into_iter()
    .filter(|cv| !bad_comment_ids.contains(&cv.id))
    .collect()
};

there may or may not need to be a clone in the first branch.

This itterator too may run when it is not needed. we only need to iterate if there are comments to filter. a check for `bad_comment_ids.is_empty()` would be best. ```rs let filtered_comments: Vec<CommentView> = if bad_comment_ids.is_empty() { comments.clone() } else { comments.into_iter() .filter(|cv| !bad_comment_ids.contains(&cv.id)) .collect() }; ``` there may or may not need to be a `clone` in the first branch.
ryexandrite marked this conversation as resolved
zongor added 1 commit 1 year ago
zongor requested review from ryexandrite 1 year ago
zongor added 1 commit 1 year ago
ryexandrite approved these changes 1 year ago
ryexandrite left a comment

LGTM

Poster
Owner

Looks good to me, the only thing I feel iffy about is allowing a (somewhat) arbitrary regex. Seems too ACE-y for me, especially since I doubt it will be legitimately utilized often. Want to get @ryexandrite's opinion on this but that's how I feel about it.

Looks good to me, the only thing I feel iffy about is allowing a (somewhat) arbitrary regex. Seems too ACE-y for me, especially since I doubt it will be legitimately utilized often. Want to get @ryexandrite's opinion on this but that's how I feel about it.
DashEightMate requested changes 1 year ago
DashEightMate left a comment

We need a couple changes here.

I just remembered - we also need to cover comments/posts coming in through Websockets as well.

server/lemmy_db/src/user_filter.rs Outdated
}
pub fn update(conn: &PgConnection, user_filter_id: i32, t: Vec<String>) -> Result<Self, Error> {
diesel::update(user_filter.find(user_filter_id))
Poster
Owner

It's impossible to update this way if the user doesn't already have a userfilters entry

It's impossible to update this way if the user doesn't already have a userfilters entry
zongor commented 1 year ago
Poster

added a on_conflict case to it.

added a on_conflict case to it.
-- This file should undo anything in `up.sql`
DROP TABLE hexbear.user_filter
Poster
Owner

This is out of date with the base branch, seems like it removes a field

This is out of date with the base branch, seems like it removes a field
Poster
Owner

this is what I was worried about, good catch.

this is what I was worried about, good catch.
zongor commented 1 year ago
Poster

I think its correct now?

I think its correct now?
server/src/api/post.rs Outdated
sorted_comments.sort_by(|a, b| a.parent_id.cmp(&b.parent_id));
// filter out bad ids and their children by the filter keywords
let mut bad_comment_ids = Vec::new();
Poster
Owner

Querying for comments is done other places other than posts. I feel we should incorporate this functionality into CommentQueryBuilder instead so it's more universal

Querying for comments is done other places other than posts. I feel we should incorporate this functionality into `CommentQueryBuilder` instead so it's more universal
server/src/api/post.rs Outdated
Err(_e) => return Err(APIError::err("couldnt_get_posts").into()),
};
let user_view = blocking(context.pool(), move |conn| {
Poster
Owner

Again - we should incorporate this into PostQueryBuilder instead.

Again - we should incorporate this into PostQueryBuilder instead.
server/src/api/post.rs Outdated
let filtered_posts: Vec<PostView> = posts
.into_iter()
.filter(|pv| {
regex.iter().all(|i| {
Poster
Owner

It would be more performant to do an any() check instead (i.e. any match)

It would be more performant to do an `any()` check instead (i.e. any match)
Poster
Owner

yes, same sugestion I made for the comments, good catch.

yes, same sugestion I made for the comments, good catch.
zongor commented 1 year ago
Poster

missed this, updated now

missed this, updated now
server/src/api/user.rs Outdated
}
blocking(context.pool(), move |conn| {
UserFilter::update(conn, user_id, ok_keywords)
Poster
Owner

This goes with my previous comment about how update() doesn't work on a uer w/o a userfilters entry

This goes with my previous comment about how `update()` doesn't work on a uer w/o a userfilters entry
Poster
Owner

I'm as of yet unconvinced on the regex functionality. I'd be a lot more comfertable with striping grouping regex charateres and only allowing []\.*+?. Rust has the memory safty to prevent large regex's from being compiled and eating cpu performance and memory but if we allow {} () thouse can quickly go exponential.

extra safe would be to not allow any of thouse charaters but that would greatly limit the power of the filter.

thoughts @DashEightMate ?

I'm as of yet unconvinced on the regex functionality. I'd be a lot more comfertable with striping grouping regex charateres and only allowing `[]\.*+?`. Rust has the memory safty to prevent large regex's from being compiled and eating cpu performance and memory but if we allow `{} ()` thouse can quickly go exponential. extra safe would be to not allow any of thouse charaters but that would greatly limit the power of the filter. thoughts @DashEightMate ?
Poster
Owner

I say we don't do any regex. Just do simple plaintext contains checks

I say we don't do any regex. Just do simple plaintext `contains` checks
Poster
Owner

@DashEightMate by a plaintext containes check do you mean in the sql query itself?

If so I'm not sure how I feel about that. the use of a sql ILIKE is a heavy performance hit for the database it makes more sense to do it with regex Rust side.

and if rust is buildin the regex anyway.... I think limiting groups and thus recursive regex can suffeciently safe and performant.

@DashEightMate by a plaintext containes check do you mean in the sql query itself? If so I'm not sure how I feel about that. the use of a sql `ILIKE` is a heavy performance hit for the database it makes more sense to do it with regex Rust side. and if rust is buildin the regex anyway.... I think limiting groups and thus recursive regex can suffeciently safe and performant.
Poster
Owner

@zongor @DashEightMate I think the one main inprovment is that insted of compialing multiple regex they should be concatenated into one large regex and compiled once.

This means that the DFA is only checking each charater in the text once insted of N times.

something like this

pub fn filter_keywords_regex(self) -> Option<Regex> {

  RegexBuilder::new(
    vec!["(".to_string(), 
      self.filter_keywords.iter().map(|w| w.retain(|c| !r#"(){}\"#.contains(c) )).collect()
        .join(")|(") , ")".to_string()]
  )
    .size_limit(2048).dfa_size_limit(1024).case_insensitive(true).build().ok()

}

the keywords should be striped of their grouping charaters and excapes only allowing *+?.

@zongor @DashEightMate I think the one main inprovment is that insted of compialing multiple regex they should be concatenated into one large regex and compiled once. This means that the DFA is only checking each charater in the text once insted of N times. something like this ```rs pub fn filter_keywords_regex(self) -> Option<Regex> { RegexBuilder::new( vec!["(".to_string(), self.filter_keywords.iter().map(|w| w.retain(|c| !r#"(){}\"#.contains(c) )).collect() .join(")|(") , ")".to_string()] ) .size_limit(2048).dfa_size_limit(1024).case_insensitive(true).build().ok() } ``` the keywords should be striped of their grouping charaters and excapes only allowing `*+?.`
Poster
Owner

@ryexandrite sorry, I meant that instead of arbitrary regexes we should do a simpler check such as String::contains(). Although if there's a way to easily constrain regexes to simple operators that might be a better option.

@ryexandrite sorry, I meant that instead of arbitrary regexes we should do a simpler check such as `String::contains()`. Although if there's a way to easily constrain regexes to simple operators that might be a better option.
Poster
Owner

Per the rust documentation

https://docs.rs/regex/1.4.5/regex/#untrusted-input

Untrusted regular expressions are handled by capping the size of a compiled regular expression. (See RegexBuilder::size_limit.) Without this, it would be trivial for an attacker to exhaust your system’s memory with expressions like a{100}{100}{100}.

Untrusted search text is allowed because the matching engine(s) in this crate have time complexity O(mn) (with m ~ regex and n ~ search text), which means there’s no way to cause exponential blow-up like with some other regular expression engines. (We pay for this by disallowing features like arbitrary look-ahead and backreferences.)

When a DFA is used, pathological cases with exponential state blow-up are avoided by constructing the DFA lazily or in an “online” manner. Therefore, at most one new state can be created for each byte of input. This satisfies our time complexity guarantees, but can lead to memory growth proportional to the size of the input. As a stopgap, the DFA is only allowed to store a fixed number of states. When the limit is reached, its states are wiped and continues on, possibly duplicating previous work. If the limit is reached too frequently, it gives up and hands control off to another matching engine with fixed memory requirements. (The DFA size limit can also be tweaked. See RegexBuilder::dfa_size_limit.)

O(mn) is an extrodenary complexity garentee for untrusted input and as long as we contrain the memory usage it's effectivly impossible to make a regex that can cause too much harm.

I think so long as we remove the ability to use capture groups we are more than safe.

my some what nieve attempt to remove the ability to use groups above is a good start.

the exact rules should be that there can be no () or {} and the last charater in a word can not be \. Also, each word should compile as a valid regex on it's own.

in that case they can be combined into one larger regex and should be quite fast

Per the rust documentation https://docs.rs/regex/1.4.5/regex/#untrusted-input > Untrusted regular expressions are handled by capping the size of a compiled regular expression. (See RegexBuilder::size_limit.) Without this, it would be trivial for an attacker to exhaust your system’s memory with expressions like a{100}{100}{100}. > > Untrusted search text is allowed because the matching engine(s) in this crate have time complexity O(mn) (with m ~ regex and n ~ search text), which means there’s no way to cause exponential blow-up like with some other regular expression engines. (We pay for this by disallowing features like arbitrary look-ahead and backreferences.) > > When a DFA is used, pathological cases with exponential state blow-up are avoided by constructing the DFA lazily or in an “online” manner. Therefore, at most one new state can be created for each byte of input. This satisfies our time complexity guarantees, but can lead to memory growth proportional to the size of the input. As a stopgap, the DFA is only allowed to store a fixed number of states. When the limit is reached, its states are wiped and continues on, possibly duplicating previous work. If the limit is reached too frequently, it gives up and hands control off to another matching engine with fixed memory requirements. (The DFA size limit can also be tweaked. See RegexBuilder::dfa_size_limit.) `O(mn)` is an extrodenary complexity garentee for untrusted input and as long as we contrain the memory usage it's effectivly impossible to make a regex that can cause too much harm. I think so long as we remove the ability to use capture groups we are more than safe. my some what nieve attempt to remove the ability to use groups above is a good start. the exact rules should be that there can be no `()` or `{}` and the last charater in a word can not be `\`. Also, each word should compile as a valid regex on it's own. in that case they can be combined into one larger regex and should be quite fast
Poster

@DashEightMate @ryexandrite
Im leaning towards getting rid of the regex as well.

The main reason that I chose to use regex in the first place was to handle "similar" unicode characters like "SS" "ss" and "ß" should all be handled the same, but come to find out rust does not handle this (in regex).

I think that if we use something like this: https://stackoverflow.com/questions/40250988/how-can-i-case-fold-a-string-in-rust
and then use a case folded string contains to do the check.

Anyone have any objections towards this method?

@DashEightMate @ryexandrite Im leaning towards getting rid of the regex as well. The main reason that I chose to use regex in the first place was to handle "similar" unicode characters like "SS" "ss" and "ß" should all be handled the same, but come to find out rust does not handle this (in regex). I think that if we use something like this: https://stackoverflow.com/questions/40250988/how-can-i-case-fold-a-string-in-rust and then use a case folded string contains to do the check. Anyone have any objections towards this method?
zongor force-pushed feature/keyword-filtering-main from 33a6322fee to 9e0c1bec40 1 year ago
zongor added 1 commit 1 year ago
zongor added 1 commit 1 year ago
zongor added 1 commit 1 year ago
Poster
Owner

Seems like the comments I left have been mostly resolved. Really the only thing left in the backend is doing this filtering through websocket messages as well.

Seems like the comments I left have been mostly resolved. Really the only thing left in the backend is doing this filtering through websocket messages as well.
zongor added 1 commit 12 months ago

Reviewers

ryexandrite approved these changes 1 year ago
DashEightMate requested changes 1 year ago
This pull request has changes conflicting with the target branch.
server/src/routes/feeds.rs
server/lemmy_api_structs/src/site.rs
server/lemmy_db/src/post_view.rs
server/lemmy_db/src/schema.rs
server/lemmy_db/src/user_view.rs
server/src/api/comment.rs
server/src/api/site.rs
server/src/api/user.rs
server/lemmy_db/src/comment_view.rs
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date

No due date set.

Blocks
#172 Add keyword filtering UI
hexbear-collective/hexbear-frontend
Loading…
There is no content yet.