Differences

This shows you the differences between two versions of the page.

--- facilities:slq_wiki:reporting [2020/12/14 11:41]
pgullberg
+++ facilities:slq_wiki:reporting [2021/01/20 16:17]
pgullberg How it works
@@ Line 4: / Line 4: @@
 ===== WikiRT - SLQ Wiki Reporting Tool =====
-WikiRT is a cross-platform, open source, application developed in-house to analyze, filter and clean up the wiki's access log file and produce accurate data for reporting activity on SLQ Wiki.
+WikiRT is a cross-platform, open source, application developed in-house to analyse, filter and clean up the wiki's access log file and produce accurate data for reporting activity on SLQ Wiki.
+==== Background ====
+==== How it works  ====
+=== Step 0 – Converting the Apache log to a Dataframe  ===
+Converts log to Pandas Dataframe by specifying separating characters, filtering columns, creating table headers. In addition, converts the date and time into a workable format so that it is automatically recognised as a timestamp and can be filtered and sorted as needed using Pandas.
+=== Step 1 – Date masking   ===
+Drops dates outside desired timeframe. The from and to dates are specified in the UI by the user.
+=== Step 2 - Filtering ===
+Filters out any known bots and web crawlers. This list can be updated and added to if there are any more identified. the bots and crawlers have been identified using both manual and automatic methods. These are based on keywords and common SQL injection queries, affecting REFERER, USER_AGENT and REQUEST (SQL queries only).
+An example of a crawler that our initial keywords didn’t pick up was WordPress jetmon which every 7 minutes (for the duration of the log) requested the exact same wiki page from the same IP address, using the same REFERER and USER AGENT for a total of approximately 10 000 queries. This is quite clearly an automated request and will therefore be filtered.
+There are currently two lists; one for bots and crawlers and one for SQL queries (automated attempt to try to access database information via something called SQL injections).
+=== Step 3 - Sessions ===
+It was advised to only count visits from a user once per session, currently set to 30 mins. Timestamps are rounded down to the frequency before dropping duplicates where both the IP and Time match, only keeping the first occurrence.
+=== Step 4 – Removing rows without user agent information ===
+It is highly unlikely that a legitimate request doesn’t have a USER AGENT. Every modern browser, will send the user agent information. If the server logs a request without this information it either means someone is trying to play the system or there was an error somewhere in the request. as a result, drop all rows without user agent.
+=== Step 5 – Calculating file downloads ===
+Counts requests for specified file types (images not included).
+=== Step 6 – Calculating unique visits ===
+Outputs the number of rows remaining in the file after the processing Step 1-5.
+=== Step 7 – Exporting csv with results ===
+Remaining data is exported to a new CSV file.
 ==== WikiRT app features:  ====
@@ Line 13: / Line 48: @@
   * Splits user activity into sessions (30mins currently as to reflect SLQ website analytics)
   * Outputs clean csv file with results for further analysis
 ==== Development tools ====
-The app was created using the Python programming language alongside Qt Designer to create the user interface. Additionally the following Python libraries were used:
+Tools
-  * pandas (powerful data analysis library)
-  * PyQt5 (Graphical user interface engine)
+This script/application is written in the python programming language and uses the following open-source libraries:
+  * [[https://pandas.pydata.org/|Pandas]] is a fast, powerful, flexible and open source data analysis and manipulation tool
+  * [[https://www.riverbankcomputing.com/static/Docs/PyQt5/introduction.html|PyQT5]] allows the creation of cross platform graphical user interfaces
+  * Helper libraries: numpy, datetime, pytz
 ==== TODO ====
   * Unordered List Item