Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | Next revision Both sides next revision | ||
facilities:slq_wiki:reporting [2020/12/14 11:41] pgullberg |
facilities:slq_wiki:reporting [2021/01/20 16:17] pgullberg How it works |
||
---|---|---|---|
Line 4: | Line 4: | ||
===== WikiRT - SLQ Wiki Reporting Tool ===== | ===== WikiRT - SLQ Wiki Reporting Tool ===== | ||
- | WikiRT is a cross-platform, | + | WikiRT is a cross-platform, |
+ | |||
+ | ==== Background ==== | ||
+ | |||
+ | ==== How it works ==== | ||
+ | |||
+ | === Step 0 – Converting the Apache log to a Dataframe | ||
+ | Converts log to Pandas Dataframe by specifying separating characters, filtering columns, creating table headers. In addition, converts the date and time into a workable format so that it is automatically recognised as a timestamp and can be filtered and sorted as needed using Pandas. | ||
+ | |||
+ | === Step 1 – Date masking | ||
+ | Drops dates outside desired timeframe. The from and to dates are specified in the UI by the user. | ||
+ | |||
+ | === Step 2 - Filtering === | ||
+ | Filters out any known bots and web crawlers. This list can be updated and added to if there are any more identified. the bots and crawlers have been identified using both manual and automatic methods. These are based on keywords and common SQL injection queries, affecting REFERER, USER_AGENT and REQUEST (SQL queries only). | ||
+ | |||
+ | An example of a crawler that our initial keywords didn’t pick up was WordPress jetmon which every 7 minutes (for the duration of the log) requested the exact same wiki page from the same IP address, using the same REFERER and USER AGENT for a total of approximately 10 000 queries. This is quite clearly an automated request and will therefore be filtered. | ||
+ | |||
+ | There are currently two lists; one for bots and crawlers and one for SQL queries (automated attempt to try to access database information via something called SQL injections). | ||
+ | |||
+ | === Step 3 - Sessions === | ||
+ | It was advised to only count visits from a user once per session, currently set to 30 mins. Timestamps are rounded down to the frequency before dropping duplicates where both the IP and Time match, only keeping the first occurrence. | ||
+ | |||
+ | === Step 4 – Removing rows without user agent information === | ||
+ | It is highly unlikely that a legitimate request doesn’t have a USER AGENT. Every modern browser, will send the user agent information. If the server logs a request without this information it either means someone is trying to play the system or there was an error somewhere in the request. as a result, drop all rows without user agent. | ||
+ | |||
+ | === Step 5 – Calculating file downloads === | ||
+ | |||
+ | Counts requests for specified file types (images not included). | ||
+ | |||
+ | === Step 6 – Calculating unique visits === | ||
+ | |||
+ | Outputs the number of rows remaining in the file after the processing Step 1-5. | ||
+ | |||
+ | === Step 7 – Exporting csv with results === | ||
+ | |||
+ | Remaining data is exported to a new CSV file. | ||
==== WikiRT app features: | ==== WikiRT app features: | ||
Line 13: | Line 48: | ||
* Splits user activity into sessions (30mins currently as to reflect SLQ website analytics) | * Splits user activity into sessions (30mins currently as to reflect SLQ website analytics) | ||
* Outputs clean csv file with results for further analysis | * Outputs clean csv file with results for further analysis | ||
+ | |||
+ | |||
==== Development tools ==== | ==== Development tools ==== | ||
- | The app was created using the Python | + | Tools |
- | * pandas | + | |
- | * PyQt5 (Graphical | + | This script/ |
+ | |||
+ | * [[https://pandas.pydata.org/ | ||
+ | * [[https:// | ||
+ | * Helper libraries: numpy, datetime, pytz | ||
==== TODO ==== | ==== TODO ==== | ||
* Unordered List Item | * Unordered List Item |