Project set up
-
Clone this repo
-
Follow these steps from this page Set up Application Default Credentials (ADC) in your local environment:
-
Install the Google Cloud CLI, then initialize it by running the following command:
gcloud init
-
Create local authentication credentials for your Google Account:
gcloud auth application-default login
-
A login screen is displayed. After you log in, your credentials are stored in the local credential file used by ADC.
For more information about working with ADC in a local environment, see Local development environment.
-
-
Run
./gradlew build
You should see:BUILD SUCCESSFUL
-
Project should be ready
Usage
Project can be run by command: ./gradlew run
This will be run in default configuration
To change configuration, you should use command parameters:
- files - number of files to fetch from Github
- batch - number of files in one batch
- mode mode=1 - it will use example files (for testing purposes) mode=2 - it will fetch files from github mode=3 - it will delete local DB with queries (for testing purposes)
- offset - number of already processed files (files from github are ordered by repozitory name) so processed files won't be fetched again
- sample sample=1 - it will use sample github collections (for testing purposes) sample=2 - it will use original github collection
Parameters can be used like this:
./gradlew run -Dfiles=100 -Dbatch=50 -Dmode=2 -Doffset=200 -Dsample=2
- This command will fetch 100 files in two 50 files batches from github (mode=2) from offset 200 (so there will be products from 200 to 300) from original github collection (sample=2)
In the testing process, i found out that most efficient batch size is 100. In this configurrations, it takes around 5 minutes to fetch this batch (100 files) from github. From this we can assume that 10000 files will take around 8 hours.
Here are some test runs which proves the lines above:
Batch size: 10 Github total request time: 17,4 min
Github requests time in ms: 1049982(98,22%)
ANTLR parsing time in ms: 7533(0,70%)
Parsing tree string finding: 5841(0,55%)
Whole time: 1068991
Number of all found queries: 18(from 100 files)
Number of TSql queries: 0
Number of Postgre SQL queries: 15
Number of PlSql queries: 0
Number of MySql queries: 3
New offset to use for query: 100
Batch size: 50 Github total request time: 6,3 min
Github requests time in ms: 376292(98,18%)
ANTLR parsing time in ms: 1688(0,44%)
Parsing tree string finding: 448(0,12%)
Whole time: 383269
Number of all found queries: 2(from 100 files)
Number of TSql queries: 0
Number of Postgre SQL queries: 2
Number of PlSql queries: 0
Number of MySql queries: 0
New offset to use for query: 200
Batch size: 100 Github total request time: 5,2 min
Github requests time in ms: 312213(95,48%)
ANTLR parsing time in ms: 5346(1,63%)
Parsing tree string finding: 5180(1,58%)
Whole time: 326986
Number of all found queries: 36(from 100 files)
Number of TSql queries: 0
Number of Postgre SQL queries: 0
Number of PlSql queries: 0
Number of MySql queries: 36
New offset to use for query: 300