diff --git a/README.md b/README.md index 2e1132622c08fd68d804056576166bbab01af036..5ac849c932c9da660d9a90c7fc1941012ef913b0 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,104 @@ -# Poject set up +# Project set up -1. \ No newline at end of file +1. Clone this repo +1. Follow these steps from this [page](https://cloud.google.com/bigquery/docs/authentication#client-libs) + + Set up Application Default Credentials (ADC) in your local environment: + + * Install the Google Cloud CLI, then initialize it by running the following command: + `gcloud init` + + * Create local authentication credentials for your Google Account: + `gcloud auth application-default login` + + * A login screen is displayed. After you log in, your credentials are stored in the local credential file used by ADC. + + For more information about working with ADC in a local environment, see [Local development environment](https://cloud.google.com/docs/authentication/provide-credentials-adc#local-dev). + +1. Run `./gradlew build` + You should see: `BUILD SUCCESSFUL` + +4. Project should be ready + +# Usage + +Project can be run by command: `./gradlew run` +This will be run in default configuration + +To change configuration, you should use command parameters: + +1. **files** - number of files to fetch from Github +1. **batch** - number of files in one batch +1. **mode** + mode=1 - it will use example files (for testing purposes) + mode=2 - it will fetch files from github + mode=3 - it will delete local DB with queries (for testing purposes) +1. **offset** - number of already processed files (files from github are ordered by repozitory name) so processed files won't be fetched again +1. **sample** + sample=1 - it will use sample github collections (for testing purposes) + sample=2 - it will use original github collection + +Parameters can be used like this: +* `./gradlew run -Dfiles=100 -Dbatch=50 -Dmode=2 -Doffset=200 -Dsample=2` +* This command will fetch **100 files** in two **50 files batches** from **github** (mode=2) from **offset 200** (so there will be products from 200 to 300) from **original github collection** (sample=2) + +In the testing process, i found out that most efficient **batch** size is **100**. In this configurrations, it takes around **5 minutes** to fetch this batch (100 files) from github. From this we can assume that **10000 files** will take around **8 hours**. + +Here are some test runs which proves the lines above: + +Batch size: 10 +Github total request time: 17,4 min +``` +Github requests time in ms: 1049982(98,22%) +ANTLR parsing time in ms: 7533(0,70%) +Parsing tree string finding: 5841(0,55%) +Whole time: 1068991 +Number of all found queries: 18(from 100 files) +Number of TSql queries: 0 +Number of Postgre SQL queries: 15 +Number of PlSql queries: 0 +Number of MySql queries: 3 +New offset to use for query: 100 +``` +Batch size: 50 +Github total request time: 6,3 min +``` +Github requests time in ms: 376292(98,18%) +ANTLR parsing time in ms: 1688(0,44%) +Parsing tree string finding: 448(0,12%) +Whole time: 383269 +Number of all found queries: 2(from 100 files) +Number of TSql queries: 0 +Number of Postgre SQL queries: 2 +Number of PlSql queries: 0 +Number of MySql queries: 0 +New offset to use for query: 200 +``` +Batch size: 100 +Github total request time: 5,2 min +``` +Github requests time in ms: 312213(95,48%) +ANTLR parsing time in ms: 5346(1,63%) +Parsing tree string finding: 5180(1,58%) +Whole time: 326986 +Number of all found queries: 36(from 100 files) +Number of TSql queries: 0 +Number of Postgre SQL queries: 0 +Number of PlSql queries: 0 +Number of MySql queries: 36 +New offset to use for query: 300 +``` +Batch size: 100 +Github total request time: 5,2 min +``` +Github requests time in ms: 312213(95,48%) +ANTLR parsing time in ms: 5346(1,63%) +Parsing tree string finding: 5180(1,58%) +Whole time: 326986 +Number of all found queries: 36(from 100 files) +Number of TSql queries: 0 +Number of Postgre SQL queries: 0 +Number of PlSql queries: 0 +Number of MySql queries: 36 +New offset to use for query: 300 +``` \ No newline at end of file