third commit

f488b82a · samuelvasecka · 660fd3cf · f488b82a
Commit f488b82a authored 1 year ago by samuelvasecka
--- a/README.md
+++ b/README.md
-# Poject set up
+# Project set up

-1. 
\ No newline at end of file
+1. Clone this repo
+1. Follow these steps from this [page](https://cloud.google.com/bigquery/docs/authentication#client-libs)
+&nbsp;
+    Set up Application Default Credentials (ADC) in your local environment:
+
+    * Install the Google Cloud CLI, then initialize it by running the following command:
+    `gcloud init`
+
+    * Create local authentication credentials for your Google Account:
+    `gcloud auth application-default login`
+    
+    * A login screen is displayed. After you log in, your credentials are stored in the local credential file used by ADC.
+
+    For more information about working with ADC in a local environment, see [Local development environment](https://cloud.google.com/docs/authentication/provide-credentials-adc#local-dev).
+&nbsp;
+1. Run `./gradlew build`
+    You should see: `BUILD SUCCESSFUL`
+
+4. Project should be ready
+
+# Usage
+
+Project can be run by command: `./gradlew run`
+This will be run in default configuration
+
+To change configuration, you should use command parameters:
+
+1. **files** - number of files to fetch from Github
+1. **batch** - number of files in one batch
+1. **mode** 
+    mode=1 - it will use example files (for testing purposes) 
+    mode=2 - it will fetch files from github
+    mode=3 - it will delete local DB with queries (for testing purposes)
+1. **offset** - number of already processed files (files from github are ordered by repozitory name) so processed files won't be fetched again
+1. **sample** 
+    sample=1 - it will use sample github collections (for testing purposes)
+    sample=2 - it will use original github collection
+
+Parameters can be used like this:
+* `./gradlew run -Dfiles=100 -Dbatch=50 -Dmode=2 -Doffset=200 -Dsample=2`
+* This command will fetch **100 files** in two **50 files batches** from **github** (mode=2) from **offset 200** (so there will be products from 200 to 300) from **original github collection** (sample=2)
+
+In the testing process, i found out that most efficient **batch** size is **100**. In this configurrations, it takes around **5 minutes** to fetch this batch (100 files) from github. From this we can assume that **10000 files** will take around **8 hours**.
+
+Here are some test runs which proves the lines above:
+
+Batch size: 10
+Github total request time: 17,4 min
+```
+Github requests time in ms: 1049982(98,22%)
+ANTLR parsing time in ms: 7533(0,70%)
+Parsing tree string finding: 5841(0,55%)
+Whole time: 1068991
+Number of all found queries: 18(from 100 files)
+Number of TSql queries: 0
+Number of Postgre SQL queries: 15
+Number of PlSql queries: 0
+Number of MySql queries: 3
+New offset to use for query: 100
+```
+Batch size: 50
+Github total request time: 6,3 min
+```
+Github requests time in ms: 376292(98,18%)
+ANTLR parsing time in ms: 1688(0,44%)
+Parsing tree string finding: 448(0,12%)
+Whole time: 383269
+Number of all found queries: 2(from 100 files)
+Number of TSql queries: 0
+Number of Postgre SQL queries: 2
+Number of PlSql queries: 0
+Number of MySql queries: 0
+New offset to use for query: 200
+```
+Batch size: 100
+Github total request time: 5,2 min
+```
+Github requests time in ms: 312213(95,48%)
+ANTLR parsing time in ms: 5346(1,63%)
+Parsing tree string finding: 5180(1,58%)
+Whole time: 326986
+Number of all found queries: 36(from 100 files)
+Number of TSql queries: 0
+Number of Postgre SQL queries: 0
+Number of PlSql queries: 0
+Number of MySql queries: 36
+New offset to use for query: 300
+```
+Batch size: 100
+Github total request time: 5,2 min
+```
+Github requests time in ms: 312213(95,48%)
+ANTLR parsing time in ms: 5346(1,63%)
+Parsing tree string finding: 5180(1,58%)
+Whole time: 326986
+Number of all found queries: 36(from 100 files)
+Number of TSql queries: 0
+Number of Postgre SQL queries: 0
+Number of PlSql queries: 0
+Number of MySql queries: 36
+New offset to use for query: 300
+```
\ No newline at end of file