I haven't played around with GPT o1; I just checked, and I don't have access. I'm not saying it's necessarily bad without having experienced it. But OpenAI has been getting steadily worse for a while, so I'm assuming that the stuff I've interacted with is indicative of the quality of the new stuff. It's all of a piece.
I've done something like this, with RSS feeds. Read !meta@rss.ponder.cat to see the existing communities, and how to add a feed to an existing community.
The concern about spam is real. A lot of these exist, for example one for Hacker News and a whole instance for Reddit, and a lot of people including myself don't like those. I agree with you that it's a good idea but it's necessary to be careful that it remains a useful seed of content and not an overwhelming spew.
You can use Creative Commons. You'll still have the copyright to the work, so you can relicense it or do whatever you like with it, but they'll have a particular and proscribed set of things they are guaranteed to be able to do with it into perpetuity.
Choose whichever license suits what you'd like to be able to grant them, in terms of whether they have to credit you for it, whether they're allowed to modify it, and so on. CC BY lets them do whatever they want, as long as they credit you, which is a common permissive option.
What are you talking about? I just tried two test queries on DDG, and neither one had LLM-generated nonsense, and the one that was in double-quotes returned only five results, all of which had the double-quoted phrase and one of which was the thing I was challenging it to find.
Can you give an example of a query where DDG returns LLM results or doesn't respect your double-quotes?
Claude.ai is quite a bit superior to GPT in my experience. That one, I pay for, and it seems like it's worth it.
Sounds good. If you redid the import, I think you’ll want to make some manual fixes to the .json. Off the top of my head, I think you just need to add bbc.co.uk and aljazeera.com to the URL lists for those sources.
I already sent it. It's here:
https://ponder.cat/wp/wp-sources.zip
Edit: You don't need to do the import initially, since there's already a sources file with some small modifications. The import is the only complicated part. Use categorize.py to categorize a source, or lookup.py to run a quick command-line test.
On a different topic: It sounds like jordanlund is saying that if he tried to remove the MBFC bot from the politics sub, he might be removed as a moderator, and replaced with someone else, and the bot would come back.
https://lemmy.world/comment/12825768
Is that true? Is the admin team mandating the use of this bot, and if so, why?
Here you go:
https://ponder.cat/wp/wp-sources.zip
It's in python, suitable for sticking directly into the bot if the bot is in python. There are docs. It's a first cut. How did you envision this working? I can make a real API, if for some reason that makes things easier, but it's not immediately obvious how it would get integrated into things.
Running it on the last 50 articles posted to /c/politics, we see:
- https://lemmy.world/post/20739836: Source is unreliable since ownership change
- https://lemmy.world/post/20736298: Source is unreliable for political topics since 2011
- https://lemmy.world/post/20724155: Reliability consensus is mixed
- https://lemmy.world/post/20723675: Source is unreliable
- https://lemmy.world/post/20722912: Source is unreliable
- https://lemmy.world/post/20722910: Reliability consensus is mixed
- https://lemmy.world/post/20716118: Reliability consensus is mixed
- https://slrpnk.net/post/14127964: Reliability consensus is mixed
It's more complex to use this than MBFC, because there's a lot more depth to the rankings, and sometimes human judgement is needed to assign scores. There's a category "needinfo," meaning it's necessary to know what topic is being discussed or when an article was written, because of an ownership change or similar factor. I've applied that judgement above. That, to me, is a good thing. It means the bot is grounded in something, and not just blithely spitting out arbitrary scores without bothering to ground them in any reality.
In practice, I think it would be realistic to assign a single reliability ranking to most of the "needinfo" sources. You can manually edit the .json data to do so. Almost all of the posts are going to fit into one of Wikipedia's categorizations or another. Newsweek is unreliable, The Guardian is reliable, and so on.
I think most of the mixed-consensus sources can be used without a second thought. Mostly, the questions about them boil down to open partisanship of the source, which for a political community is perfectly fine as long as they're trustable factually.
If you want me to boil this down further, so that it gives a single "yes" or "no" score to each source, I can do that and probably keep almost all of the accuracy of the rankings, now that I've looked at it for a little while.
When you talk about "adding" this to the bot, are you proposing to still have MBFC be the main source, with this as a footnote? A lot of the criticism of the bot is on the grounds that MBFC is a very bad source for judging reliability, so I would question the idea of keeping it on as the primary source.
Why is it admin level? Are there admins that tell you what you can and can't do with the politics community, in this case? Or does the politics moderation team have the ability to ditch the bot if they decide to?
This is such a strange situation. If you're stuck in that former position, though, it would make a lot of your responses in this comments section make a whole lot more sense.
You don't have to go back 20 years. They also committed a fairly big oopsie, not that long ago.
The Guardian: I don't think this one article about renters from 2020 proves its case very well. Personally, I'm not convinced. MIXED
New York Times: You really think someone would do that? Just go on the internet and tell lies? I don't think so.