Skip to content
Snippets Groups Projects
user avatar
samuelvasecka authored
dc76d55e

Project set up

  1. Clone this repo

  2. Follow these steps from this page   Set up Application Default Credentials (ADC) in your local environment:

    • Install the Google Cloud CLI, then initialize it by running the following command: gcloud init

    • Create local authentication credentials for your Google Account: gcloud auth application-default login

    • A login screen is displayed. After you log in, your credentials are stored in the local credential file used by ADC.

    For more information about working with ADC in a local environment, see Local development environment.  

  3. Run ./gradlew build You should see: BUILD SUCCESSFUL

  4. Project should be ready

Usage

Project can be run by command: ./gradlew run This will be run in default configuration

To change configuration, you should use command parameters:

  1. files - number of files to fetch from Github
  2. batch - number of files in one batch
  3. mode mode=1 - it will use example files (for testing purposes) mode=2 - it will fetch files from github mode=3 - it will delete local DB with queries (for testing purposes)
  4. offset - number of already processed files (files from github are ordered by repozitory name) so processed files won't be fetched again
  5. sample sample=1 - it will use sample github collections (for testing purposes) sample=2 - it will use original github collection

Parameters can be used like this:

  • ./gradlew run -Dfiles=100 -Dbatch=50 -Dmode=2 -Doffset=200 -Dsample=2
  • This command will fetch 100 files in two 50 files batches from github (mode=2) from offset 200 (so there will be products from 200 to 300) from original github collection (sample=2)

In the testing process, i found out that most efficient batch size is 100. In this configurrations, it takes around 5 minutes to fetch this batch (100 files) from github. From this we can assume that 10000 files will take around 8 hours.

Here are some test runs which proves the lines above:

Batch size: 10 Github total request time: 17,4 min

Github requests time in ms: 1049982(98,22%)
ANTLR parsing time in ms: 7533(0,70%)
Parsing tree string finding: 5841(0,55%)
Whole time: 1068991
Number of all found queries: 18(from 100 files)
Number of TSql queries: 0
Number of Postgre SQL queries: 15
Number of PlSql queries: 0
Number of MySql queries: 3
New offset to use for query: 100

Batch size: 50 Github total request time: 6,3 min

Github requests time in ms: 376292(98,18%)
ANTLR parsing time in ms: 1688(0,44%)
Parsing tree string finding: 448(0,12%)
Whole time: 383269
Number of all found queries: 2(from 100 files)
Number of TSql queries: 0
Number of Postgre SQL queries: 2
Number of PlSql queries: 0
Number of MySql queries: 0
New offset to use for query: 200

Batch size: 100 Github total request time: 5,2 min

Github requests time in ms: 312213(95,48%)
ANTLR parsing time in ms: 5346(1,63%)
Parsing tree string finding: 5180(1,58%)
Whole time: 326986
Number of all found queries: 36(from 100 files)
Number of TSql queries: 0
Number of Postgre SQL queries: 0
Number of PlSql queries: 0
Number of MySql queries: 36
New offset to use for query: 300