LyricWiki:Bot Portal
Talk226this wiki
|
Contents |
Overview
This is a page where those with the knowledge can share regex, code, etc.
Please leave related messages and requests for help on the Bot Portal talk page.
Process To Make a New Bot
Edit
Getting a bot account
Edit
The difference of having a bot 'flag' on your account just means that your changes won't show up on the Recent Changes list by default. This is to reduce clutter since bots tend to make a ton of edits.
Once you've finished writing your bot, the general process we follow (a la Wikipedia) is to run the bot for about a day while it is still a "normal" user account at a relatively slow pace (maybe one request per minute) to allow the whole community to see the changes in the Recent Changes list. This gives a chance to have a lot of eyeballs looking at the changes to catch the occasional bug (we've all been known to write those from time to time ;)). After it looks good, just leave a message for Sean or another Bureaucrat and we'll give you a bot flag.
Please limit queries of the server to once every 1-2 seconds, and page changes once every 10-20 seconds, depending on server load. With replace.py that is part of PyWikipediaBot, this can be done by adding the arguments -sleep:2 -pt:20. In custom scripts based on PyWikipediaBot, you need to call wikipedia.handleArgs() to allow the use of -pt: or set the delay with wikipedia.put_throttle.setDelay(20, absolute = True). Alternatively you can edit config.py and change the the put_throttle option to alter the behavior globally.
Code
Edit
Please put suggestions, snippets and code (that has been tried and tested) in this section. Especially welcome would be examples of code commonly used on wikis, particularly here at LyricWiki, that others may find of use. Do not take anything for granted - what might be obvious to one, may not be so to others. Thanks!
Pywikipedia settings
Edit
Settings in user-config.py:
family = 'lyricwiki' mylang = 'en' usernames['lyricwiki']['en'] = u'Username'
Changes to families/lyricwiki_family.py:
class Family(family.Family):
def __init__(self):
family.Family.__init__(self)
self.name = 'lyricwiki'
self.langs = {
'en': 'lyrics.wikia.com',
}
self.namespaces[4] = {
'_default': [u'LyricWiki', self.namespaces[4]['_default']],
}
self.namespaces[5] = {
'_default': [u'LyricWiki talk', self.namespaces[5]['_default']],
}
def version(self, code):
return "1.14.0"
def scriptpath(self, code):
return
def path(self, code):
return '/index.php'
def apipath(self, code):
return '/api.php'
def disambcategory (self, code):
return "Category:Disambiguation_Page"
Fixing Broken NOTOC and NOEDITSECTION
Edit
Originally posted by Redxx @22:39, 12 October 2008 (UTC):
python replace.py -start:Frank_Sinatra -sleep:1 -pt:10 " NOTOC " "__NOTOC__" python replace.py -start:Frank_Sinatra -sleep:1 -pt:10 " NOEDITSECTION " "__NOEDITSECTION__"
I successfully used the above to replace the underscores that had been accidentally removed from either side of the NOTOC and NOEDITSECTION on the Frank Sinatra album pages. Although I decided not to do it this way (and therefore did not test this), I believe the regex equivalent is:
python replace.py -start:Frank_Sinatra -regex "\s{2}NO(TOC|SECTION)\s{2}" "__NO\1__"
- And it might be safer to just replace: ([A-Z]{3,11}) by (TOC|SECTION). --Mischko 07:56, 18 October 2008 (UTC)
- Thanks Mischko, I have updated example. ♫♫Яєdxx ♪♫♪♫♪ Actions Words 00:55, 7 August 2009 (UTC)
Fixing Song Page Ranking
Edit
Originally posted by team a @07:15, 25 October 2008 (UTC):
This command is designed to convert all song pages in Category:Review Me to pages with Green stars, as per LyricWiki:Page ranking. The pages are removed from the category, and "star=Green" is added to the page's {{Song}} template.
Basic Form
Edit
This is the stripped-down version, to explain the basics. It assumes that Category:Review Me appears at the top of the page above {{Song}}, and that the page does not already have a star of any kind. It captures everything starting with {{Song up until, but not including, the }}' at the end of the template, and adds "|star=Green}}" to the end, then puts it back into the page.
replace.py -regex "[\[Category:Review[\s_]Me\]\]\s*(\{\{Song\|[^|]*\|[^}|])\}\}" "\1|star=Green}}"
Limitations: Doesn't work with songs with featured artists, or that aren't in Category:Review Me but are missing a star, or are in Category:Review Me despite the fact that they have a star.
Advanced Form
Edit
This is the form that I actually use. It can deal with featured artists, Green and Black (if there are any) stars, and includes some redundancies in case the {{Song}} is broken, i.e. missing end brackets ([^|{}]). It also deals with songs that aren't in Category:Review Me (and I've found some). This should help fix some user errors, or confusions with the Page Ranking Policy.
replace.py -regex "(?:\[\[Category:Review[\s_]Me\]\])?\s*(\{\{Song (?:\|[^|{}]*){2}(?:\|fa\d?=[^{}|]*)*)(\|star=(Black|Green))?\}\}" "\1|star=Green}}"
Limitations: Can't deal with Category:Review Me if it occurs after the song template (which it shouldn't, as it was added to the top automatically), and can't remove pages with other star ratings from Category:Review Me.
Thanks very much to Aquatiki and Senvaikis for their help with regex.
See the development of this topic in Notorious' Archive, but please leave comments/corrections/suggestions here, on the Bot Portal talk page. Thanks. -team a
Move Genre
Edit
Originally posted by team a @07:48, 3 November 2008 (UTC):
This pywikipedia command edits all pages in one genre, removing them from that genre and adding them to a second genre instead. I'm posting it here as an example of how using an unescaped pipe (| not \|) can be used to combine what would otherwise be multiple regex statements. This can deal with both artist and album pages.
General Form
Edit
replace.py -sleep:2 -pt:20 -regex -cat:Genre/Hip-Hop "(\|\s*[Gg]enre\s*=\s*|\|\s*genre2\s*=\s*|\[\[Category:Genre/)FIRST GENRE" "\1SECOND GENRE"
Example
Edit
Move all pages in Category:Genre/Hip-Hop to Category:Genre/Hip Hop:
replace.py -sleep:2 -pt:20 -regex -cat:Genre/Hip-Hop "(\|\s*[Gg]enre\s*=\s*|\|\s*genre2\s*=\s*|\[\[Category:Genre/)Hip-Hop" "\1Hip Hop"
Edit
Originally posted by Hs @16:33, 10 January 2009 (UTC):
Regular expressions (in Python) for three cases: The language key exists and is empty, the language key exists and is filled, or no language key exists.
_RE_LANG_EMPTY = re.compile(r'(\{\{\s*SongFooter(?:\s+\|.*?)*?\|\s*language\s*=\s*?)((?:[\r\n]+\s*)?(?:\|.*?)*?\}\})', re.DOTALL)
_RE_LANG_EXISTENT = re.compile(r'(\{\{\s*SongFooter\s+(?:\|.*?)*?\|\s*language\s*=\s*).*?((?:[\r\n]+\s*)?(?:\|.*?)*?\}\})', re.DOTALL)
_RE_NO_LANG = re.compile(r'(\{\{\s*SongFooter\s+(?:\|.*?)*?)((?:[\r\n]+\s*)?\}\})', re.DOTALL)
And the corresponding function to set the language. Set force to True to overwrite an existing language value.
def _setLanguage(self, text, language, force=False):
"""Set the language in the SongFooter template. Set force to True to
override an already set language. Returns None on error.
"""
count = 0
if force:
(text, count) = _RE_LANG_EXISTENT.subn("\\1%s\\2" % language, text)
else:
(text, count) = _RE_LANG_EMPTY.subn("\\1%s\\2" % language, text)
if not count:
(text, count) = _RE_NO_LANG.subn("\\1\n|language = %s\\2" % language, text)
if count:
return text
else:
None
TitleCase function
Edit
Originally posted by Hs @16:38, 10 January 2009 (UTC):
Regex to match the beginning of a word
_RE_TOUPPER = re.compile(r'(^|\s|[\"\(\[])(\w)', re.UNICODE)
And the corresponding function. name must be a Unicode object, in order that the uppercase function works for non-ASCII characters.
def TitleCase(name):
return _RE_TOUPPER.sub(lambda match: match.group(1) + match.group(2).upper(), name)
Asynchronous writes in pywikipedia
Edit
Originally posted by Hs @21:20, 11 February 2009 (UTC):
pywikipedia has the method Page.put_async(text) which allows to write to a page in the background. To limit the size of the queue of pages waiting to be written put something like this at the beginning of your code:
import wikipedia import Queue # ... wikipedia.page_put_queue = Queue.Queue(5)
This way, if you have a sensible put throttle (e.g. 20 seconds), you can have slight concurrency aligned to the writing speed.
Orphaned pages for one artist
Edit
#!/usr/bin/python
#
# Find orphaned song pages on lyrics.wikia.com by comparing the list of pages
# with prefix <Artist>: with what is actually linked on page <Artist>.
#
# This will not work for artist pages that have been split onto several pages,
# like Rolling_Stones. Also not all pages need to be linked, e.g. translations.
import wikipedia
from wikipedia import Page
from pagegenerators import PrefixingPageGenerator
def usage():
print("""
Usage: ./orphans.py Artist
""")
def orphans(artist):
prefix = artist + ":"
allPages = PrefixingPageGenerator(prefix, includeredirects = False)
allPages = set(map(lambda p: p.title(), allPages))
site = wikipedia.getSite()
artistPage = Page(site, artist)
linkedPages = artistPage.linkedPages()
linkedPages = map(lambda p: p.title(), linkedPages)
linkedPages = filter(lambda s: s.startswith(prefix), linkedPages)
linkedPages = set(linkedPages)
orphanedPages = list(allPages.difference(linkedPages))
orphanedPages.sort()
print("\n".join(orphanedPages))
def main():
argv = wikipedia.handleArgs()
if len(argv) != 1:
usage()
return
orphans(unicode(argv[0])) # hopefully this is correct
if __name__ == "__main__":
try:
main()
finally:
wikipedia.stopme()
--Hfs·☏·✎ 22:37, June 19, 2010 (UTC)
- I changed the code for output…
print("\n".join(orphanedPages))
- …to…
for songPage in orphanedPages:
s = songPage.encode("Latin-1")
print("# '''[[" + s + "|" + s[len(artist)+1:] + "]]'''")
- This creates a list I can paste directly into the OS section (after de-wrapping the occasional long line) and fixes problems with non-ascii chars. You might have to change "Latin-1" to whatever charset your terminal uses.
- CAVEAT: I don't know the first thing about Python and arrived at the above code by googling. It might be horrendously wrong and/or an insult to all things Python for all I know. Works for me though. — 6×9 (Talk) 18:06, December 9, 2011 (UTC)
Other Languages
Edit
Perl
Edit
- TODO: Make Sean's framework and tutorial available
PHP
Edit
Python
Edit
- See PyWikipediaBot above.