Skip to content
Snippets Groups Projects

Master

Closed vas0259 requested to merge master into main
Viewing commit f488b82a
Show latest version
1 file
+ 103
2
Preferences
Compare changes
+ 103
2
# Poject set up
# Project set up
1.
1. Clone this repo
\ No newline at end of file
1. Follow these steps from this [page](https://cloud.google.com/bigquery/docs/authentication#client-libs)
 
 
 
Set up Application Default Credentials (ADC) in your local environment:
 
 
* Install the Google Cloud CLI, then initialize it by running the following command:
 
`gcloud init`
 
 
* Create local authentication credentials for your Google Account:
 
`gcloud auth application-default login`
 
 
* A login screen is displayed. After you log in, your credentials are stored in the local credential file used by ADC.
 
 
For more information about working with ADC in a local environment, see [Local development environment](https://cloud.google.com/docs/authentication/provide-credentials-adc#local-dev).
 
 
 
1. Run `./gradlew build`
 
You should see: `BUILD SUCCESSFUL`
 
 
4. Project should be ready
 
 
# Usage
 
 
Project can be run by command: `./gradlew run`
 
This will be run in default configuration
 
 
To change configuration, you should use command parameters:
 
 
1. **files** - number of files to fetch from Github
 
1. **batch** - number of files in one batch
 
1. **mode**
 
mode=1 - it will use example files (for testing purposes)
 
mode=2 - it will fetch files from github
 
mode=3 - it will delete local DB with queries (for testing purposes)
 
1. **offset** - number of already processed files (files from github are ordered by repozitory name) so processed files won't be fetched again
 
1. **sample**
 
sample=1 - it will use sample github collections (for testing purposes)
 
sample=2 - it will use original github collection
 
 
Parameters can be used like this:
 
* `./gradlew run -Dfiles=100 -Dbatch=50 -Dmode=2 -Doffset=200 -Dsample=2`
 
* This command will fetch **100 files** in two **50 files batches** from **github** (mode=2) from **offset 200** (so there will be products from 200 to 300) from **original github collection** (sample=2)
 
 
In the testing process, i found out that most efficient **batch** size is **100**. In this configurrations, it takes around **5 minutes** to fetch this batch (100 files) from github. From this we can assume that **10000 files** will take around **8 hours**.
 
 
Here are some test runs which proves the lines above:
 
 
Batch size: 10
 
Github total request time: 17,4 min
 
```
 
Github requests time in ms: 1049982(98,22%)
 
ANTLR parsing time in ms: 7533(0,70%)
 
Parsing tree string finding: 5841(0,55%)
 
Whole time: 1068991
 
Number of all found queries: 18(from 100 files)
 
Number of TSql queries: 0
 
Number of Postgre SQL queries: 15
 
Number of PlSql queries: 0
 
Number of MySql queries: 3
 
New offset to use for query: 100
 
```
 
Batch size: 50
 
Github total request time: 6,3 min
 
```
 
Github requests time in ms: 376292(98,18%)
 
ANTLR parsing time in ms: 1688(0,44%)
 
Parsing tree string finding: 448(0,12%)
 
Whole time: 383269
 
Number of all found queries: 2(from 100 files)
 
Number of TSql queries: 0
 
Number of Postgre SQL queries: 2
 
Number of PlSql queries: 0
 
Number of MySql queries: 0
 
New offset to use for query: 200
 
```
 
Batch size: 100
 
Github total request time: 5,2 min
 
```
 
Github requests time in ms: 312213(95,48%)
 
ANTLR parsing time in ms: 5346(1,63%)
 
Parsing tree string finding: 5180(1,58%)
 
Whole time: 326986
 
Number of all found queries: 36(from 100 files)
 
Number of TSql queries: 0
 
Number of Postgre SQL queries: 0
 
Number of PlSql queries: 0
 
Number of MySql queries: 36
 
New offset to use for query: 300
 
```
 
Batch size: 100
 
Github total request time: 5,2 min
 
```
 
Github requests time in ms: 312213(95,48%)
 
ANTLR parsing time in ms: 5346(1,63%)
 
Parsing tree string finding: 5180(1,58%)
 
Whole time: 326986
 
Number of all found queries: 36(from 100 files)
 
Number of TSql queries: 0
 
Number of Postgre SQL queries: 0
 
Number of PlSql queries: 0
 
Number of MySql queries: 36
 
New offset to use for query: 300
 
```
 
\ No newline at end of file