mwtools
MediaWikiTools class module.
MediaWiki Tools
A high level library containing set of tools for for filtering pages using the rich data available in MediaWikis such as categories and info boxes. Uses both web-scraping and API methods (where available and feasible) to gather information.
Basic Usage
Create MediaWikiTools
object:
from mwtools import MediaWikiTools
hp_wiki = MediaWikiTools('harrypotter.fandom.com')
hp_wiki.has_api # True
Getting pages
Get page names from a category.
hp_wiki.get_pages('1980_births')
# ['Dudley Dursley',
# 'Neville Longbottom',
# 'Ernest Macmillan',
# 'Draco Malfoy',
# 'Harry Potter',
# 'Ronald Weasley']
Get pages from subcategories too.
wiki = MediaWikiTools('en.wikipedia.org')
# get pages from category that contains only subcategories
wiki.get_pages("Art_collectors_by_nationality")
# []
# get pages from first level subcategories
wiki.get_pages("Art_collectors_by_nationality", get_subcats=True)
# ['William Hayes Ackland',
# 'William Acquavella',
# 'Frederick Baldwin Adams Jr.',
# 'Marella Agnelli',
# 'Robert Agostinelli',
# ...]
Recursively get pages from subcategories. This can be quite slow for deep subcategory trees and may break if loops are present.
wiki.get_pages("Art_collectors_by_nationality",
get_subcats=True,
recursive=True)
# ['Very',
# 'Long,
# 'List']
Get pages from subcategories as a dictionary containing subcategories as keys.
'self'
contains the pages in the root category. Using recursive
in
combination with the with_subcats
results in a nested dictionary.
# get first level subcats in dict
wiki.get_pages("Art_collectors_by_nationality",
get_subcats=True, with_subcats=True)
# {'self': [],
# 'American art collectors': ['William Hayes Ackland',
# 'William Acquavella',
# 'Frederick Baldwin Adams Jr.',
# ...],
# ...
# 'Venezuelan art collectors': ['Gustavo Cisneros',
# 'Patricia Phelps de Cisneros',
# 'Nina Fuentes'],
# 'Yugoslav art collectors': ['Antun Bauer (museologist)', 'Erich Šlomović']}
Getting sets
Get an intersection of 2 or more categories.
hp_wiki.get_set(['1980_births', 'Hogwarts_dropouts'], operations='&')
# ['Harry Potter', 'Ronald Weasley']
hp_wiki.get_set(['1980_births',
'Hogwarts_dropouts',
'Green-eyed_individuals'],
operations='intersection')
# ['Harry Potter']
Get a union of 2 or more categories.
wiki = MediaWikiTools('en.wikipedia.org')
# get the union of countries in Europe and Asia and save it
countries = wiki.get_set(['Countries in Asia', 'Countries_in_Europe'],
operations='or')
print(countries)
# ['Cyprus',
# 'Pakistan',
# 'Croatia',
# ...
# 'Belarus',
# 'Bangladesh',
# 'Lithuania']
Chaining operations.
# intersect the saved list with a different category
wiki.get_set(['Russian-speaking_countries_and_territories', countries],
operations='&')
# ['Kyrgyzstan',
# 'Moldova',
# 'Russia',
# 'Armenia',
# 'Tajikistan',
# 'Belarus',
# 'Azerbaijan',
# 'Uzbekistan',
# 'Mongolia',
# 'Kazakhstan']
Without any saved variables. The number of operations must equal the number of categories minus one.
wiki.get_set(['Countries in Asia',
'Countries_in_Europe',
'Russian-speaking_countries_and_territories'],
operations=['or', 'and'])
# same as above
View Source
"""MediaWikiTools class module. # MediaWiki Tools A high level library containing set of tools for for filtering pages using the rich data available in MediaWikis such as categories and info boxes. Uses both web-scraping and API methods (where available and feasible) to gather information. # Basic Usage Create `MediaWikiTools` object: ```python from mwtools import MediaWikiTools hp_wiki = MediaWikiTools('harrypotter.fandom.com') hp_wiki.has_api # True ``` ## Getting pages Get page names from a category. ```python hp_wiki.get_pages('1980_births') # ['Dudley Dursley', # 'Neville Longbottom', # 'Ernest Macmillan', # 'Draco Malfoy', # 'Harry Potter', # 'Ronald Weasley'] ``` Get pages from subcategories too. ```python wiki = MediaWikiTools('en.wikipedia.org') # get pages from category that contains only subcategories wiki.get_pages("Art_collectors_by_nationality") # [] # get pages from first level subcategories wiki.get_pages("Art_collectors_by_nationality", get_subcats=True) # ['William Hayes Ackland', # 'William Acquavella', # 'Frederick Baldwin Adams Jr.', # 'Marella Agnelli', # 'Robert Agostinelli', # ...] ``` Recursively get pages from subcategories. This can be quite slow for deep subcategory trees and may break if loops are present. ```python wiki.get_pages("Art_collectors_by_nationality", get_subcats=True, recursive=True) # ['Very', # 'Long, # 'List'] ``` Get pages from subcategories as a dictionary containing subcategories as keys. `'self'` contains the pages in the root category. Using `recursive` in combination with the `with_subcats` results in a nested dictionary. ```python # get first level subcats in dict wiki.get_pages("Art_collectors_by_nationality", get_subcats=True, with_subcats=True) # {'self': [], # 'American art collectors': ['William Hayes Ackland', # 'William Acquavella', # 'Frederick Baldwin Adams Jr.', # ...], # ... # 'Venezuelan art collectors': ['Gustavo Cisneros', # 'Patricia Phelps de Cisneros', # 'Nina Fuentes'], # 'Yugoslav art collectors': ['Antun Bauer (museologist)', 'Erich Šlomović']} ``` ## Getting sets Get an intersection of 2 or more categories. ```python hp_wiki.get_set(['1980_births', 'Hogwarts_dropouts'], operations='&') # ['Harry Potter', 'Ronald Weasley'] hp_wiki.get_set(['1980_births', 'Hogwarts_dropouts', 'Green-eyed_individuals'], operations='intersection') # ['Harry Potter'] ``` Get a union of 2 or more categories. ```python wiki = MediaWikiTools('en.wikipedia.org') # get the union of countries in Europe and Asia and save it countries = wiki.get_set(['Countries in Asia', 'Countries_in_Europe'], operations='or') print(countries) # ['Cyprus', # 'Pakistan', # 'Croatia', # ... # 'Belarus', # 'Bangladesh', # 'Lithuania'] ``` Chaining operations. ```python # intersect the saved list with a different category wiki.get_set(['Russian-speaking_countries_and_territories', countries], operations='&') # ['Kyrgyzstan', # 'Moldova', # 'Russia', # 'Armenia', # 'Tajikistan', # 'Belarus', # 'Azerbaijan', # 'Uzbekistan', # 'Mongolia', # 'Kazakhstan'] ``` Without any saved variables. The number of operations must equal the number of categories minus one. ```python wiki.get_set(['Countries in Asia', 'Countries_in_Europe', 'Russian-speaking_countries_and_territories'], operations=['or', 'and']) # same as above ``` """