Commit 1885bbf9 authored by Ammar Harrat's avatar Ammar Harrat
Browse files

Initial commit

parents
MIT License
Copyright (c) 2018 Technion
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
This is my version of [code2vec](https://github.com/tech-srl/code2vec) work for Python3. It works only on keras implementation as for now. Some basic changes done:
0) Added Jupyter notebook with [preprocessing](pre-preprocessing.ipynb) of code snippets
1) Support of Python3 code thx to [JB miner](https://github.com/JetBrains-Research/astminer)
2) Support of code embeddings (a.k.a. before the last dense layer, witch originally works only for TF implementation)
3) Getting target and token embeddings by running .sh
4) Getting top 10 synonyms for given label.
The rest of the README is almost the same with the original [code2vec](https://github.com/tech-srl/code2vec), but with some changes considering my implemetation. You should understand that the original work has a lot more opportunities (including already trained on Java models) so I really recommend working with it. Here I leave some dependencies on file and folder names, but anyone can get through them.
# Code2vec
A neural network for learning distributed representations of code.
This is made on top of the implementation of the model described in:
[Uri Alon](http://urialon.cswp.cs.technion.ac.il), [Meital Zilberstein](http://www.cs.technion.ac.il/~mbs/), [Omer Levy](https://levyomer.wordpress.com) and [Eran Yahav](http://www.cs.technion.ac.il/~yahave/),
"code2vec: Learning Distributed Representations of Code", POPL'2019 [[PDF]](https://urialon.cswp.cs.technion.ac.il/wp-content/uploads/sites/83/2018/12/code2vec-popl19.pdf)
_**October 2018** - The paper was accepted to [POPL'2019](https://popl19.sigplan.org)_!
_**April 2019** - The talk video is available [here](https://www.youtube.com/watch?v=EJ8okcxL2Iw)_.
_**July 2019** - Add `tf.keras` model implementation
An **online demo** is available at [https://code2vec.org/](https://code2vec.org/).
#### Only keras version for now.
<center style="padding: 40px"><img width="95%" src="https://github.com/Kirili4ik/NL2ML/blob/master/c2v-arc.jpg" /></center>
Table of Contents
=================
* [Requirements](#requirements)
* [Quickstart](#quickstart)
* [Configuration](#configuration)
* [Features](#features)
* [Citation](#citation)
## Requirements
On Ubuntu:
* [Python3](https://www.linuxbabe.com/ubuntu/install-python-3-6-ubuntu-16-04-16-10-17-04) (>=3.6). To check the version:
> python3 --version
* TensorFlow - version 2.0.0 ([install](https://www.tensorflow.org/install/install_linux)).
To check TensorFlow version:
> python3 -c 'import tensorflow as tf; print(tf.\_\_version\_\_)'
* If you are using a GPU, you will need CUDA 10.0
([download](https://developer.nvidia.com/cuda-10.0-download-archive-base))
as this is the version that is currently supported by TensorFlow. To check CUDA version:
> nvcc --version
* For GPU: cuDNN (>=7.5) ([download](http://developer.nvidia.com/cudnn)) To check cuDNN version:
> cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2
* For creating a new dataset (any operation that requires parsing of a new code example) - [JetBrains astminer](https://github.com/JetBrains-Research/astminer/tree/master-dev/astminer-cli) (their cli is already [here](cd2vec/cli.jar)
## Quickstart
### Step 0: Cloning this repository
```
git clone https://github.com/Kirili4ik/code2vec
cd code2vec
```
### Step 1: Creating a new dataset from java sources
In order to have a preprocessed dataset to train a network on you should create a new dataset of your own. It consists from 3 folders train, test and validation.
#### Creating and preprocessing a new Python dataset
In order to create and preprocess a new dataset (for example, to compare code2vec to another model on another dataset):
* Edit the file [preprocess.sh](preprocess.sh) using the instructions there, pointing it to the correct training, validation and test directories.
* Run the preprocess.sh file:
> source preprocess.sh
### Step 2: Training a model
You should train a new model using a preprocessed dataset.
#### Training a model from scratch
To train a model from scratch:
* Edit the file [train.sh](train.sh) to point it to the right preprocessed data. By default,
it points to my "my_dataset" dataset that was preprocessed in the previous step.
* Before training, you can edit the configuration hyper-parameters in the file [config.py](config.py),
as explained in [Configuration](#configuration).
* Run the [train.sh](train.sh) script:
> source train.sh
##### Notes:
1. By default, the network is evaluated on the validation set after every training epoch.
2. The newest 10 versions are kept (older are deleted automatically). This can be changed, but will be more space consuming.
3. By default, the network is training for 20 epochs.
These settings can be changed by simply editing the file [config.py](config.py). You may need lots and lots of data because of the simplicity of the model.
### Step 3: Evaluating a trained model
Once the score on the validation set stops improving over time, you can stop the training process (by killing it)
and pick the iteration that performed the best on the validation set.
Suppose that iteration #8 is our chosen model, run:
```
python3 code2vec.py --framework keras --load models/my_first_model/saved_model --test data/my_dataset/my_dataset.test.c2v
```
### Step 4: Manual examination of a trained model
To manually examine a trained model, run:
```
source my_predict.sh
```
After the model loads, follow the instructions and edit the file [Input.py](pred_files/Input.py) and enter a Python
method or code snippet, and examine the model's predictions and attention scores.
## Step 5: Getting embeddings
Follow Step 4 and embedding for your snippet will be in [EMBEDDINGS.txt](cd2vec/EMBEDDINGS.txt) file.
## Step 6: Look at synonyms
Run command:
>python3 my_find_synonim.py --label 'linear|algebra'
Or any other tag and look at the closest to it.
## Configuration
Changing hyper-parameters is possible by editing the file
[config.py](config.py).
Here are some of the parameters and their description:
#### config.NUM_TRAIN_EPOCHS = 20
The max number of epochs to train the model. Stopping earlier must be done manually (kill).
#### config.SAVE_EVERY_EPOCHS = 1
After how many training iterations a model should be saved.
#### config.TRAIN_BATCH_SIZE = 1024
Batch size in training.
#### config.TEST_BATCH_SIZE = config.TRAIN_BATCH_SIZE
Batch size in evaluating. Affects only the evaluation speed and memory consumption, does not affect the results.
#### config.TOP_K_WORDS_CONSIDERED_DURING_PREDICTION = 10
Number of words with highest scores in $ y_hat $ to consider during prediction and evaluation.
#### config.NUM_BATCHES_TO_LOG_PROGRESS = 100
Number of batches (during training / evaluating) to complete between two progress-logging records.
#### config.NUM_TRAIN_BATCHES_TO_EVALUATE = 100
Number of training batches to complete between model evaluations on the test set.
#### config.READER_NUM_PARALLEL_BATCHES = 4
The number of threads enqueuing examples to the reader queue.
#### config.SHUFFLE_BUFFER_SIZE = 10000
Size of buffer in reader to shuffle example within during training.
Bigger buffer allows better randomness, but requires more amount of memory and may harm training throughput.
#### config.CSV_BUFFER_SIZE = 100 * 1024 * 1024 # 100 MB
The buffer size (in bytes) of the CSV dataset reader.
#### config.MAX_CONTEXTS = 200
The number of contexts to use in each example.
#### config.MAX_TOKEN_VOCAB_SIZE = 1301136
The max size of the token vocabulary.
#### config.MAX_TARGET_VOCAB_SIZE = 261245
The max size of the target words vocabulary.
#### config.MAX_PATH_VOCAB_SIZE = 911417
The max size of the path vocabulary.
#### config.DEFAULT_EMBEDDINGS_SIZE = 128
Default embedding size to be used for token and path if not specified otherwise.
#### config.TOKEN_EMBEDDINGS_SIZE = config.EMBEDDINGS_SIZE
Embedding size for tokens.
#### config.PATH_EMBEDDINGS_SIZE = config.EMBEDDINGS_SIZE
Embedding size for paths.
#### config.CODE_VECTOR_SIZE = config.PATH_EMBEDDINGS_SIZE + 2 * config.TOKEN_EMBEDDINGS_SIZE
Size of code vectors.
#### config.TARGET_EMBEDDINGS_SIZE = config.CODE_VECTOR_SIZE
Embedding size for target words.
#### config.MAX_TO_KEEP = 10
Keep this number of newest trained versions during training.
#### config.DROPOUT_KEEP_RATE = 0.75
Dropout rate used during training.
#### config.SEPARATE_OOV_AND_PAD = False
Whether to treat `<OOV>` and `<PAD>` as two different special tokens whenever possible.
## Features
Code2vec supports the following features:
### Releasing the model (not sure)
If you wish to keep a trained model for inference only (without the ability to continue training it) you can
release the model using:
```
python3 code2vec.py --load models/my_first_model/saved_model --release
```
This will save a copy of the trained model with the '.release' suffix.
A "released" model usually takes 3x less disk space.
### Exporting the trained token vectors and target vectors
These saved embeddings are saved without subtoken-delimiters ("*toLower*" is saved as "*tolower*").
In order to export embeddings from a trained model, use:
> source my_get_embeddings.sh
This creates 2 files [tokens.txt](models/my_first_model/tokens.txt) and [targets.txt](models/my_first_model/targets.txt)
This saves the tokens/targets embedding matrices in word2vec format to the specified text file, in which:
the first line is: \<vocab_size\> \<dimension\>
and each of the following lines contains: \<word\> \<float_1\> \<float_2\> ... \<float_dimension\>
These word2vec files can be manually parsed or easily loaded and inspected using the [gensim](https://radimrehurek.com/gensim/models/word2vec.html) python package:
```python
python3
>>> from gensim.models import KeyedVectors as word2vec
>>> vectors_text_path = 'models/java14_model/targets.txt' # or: `models/java14_model/tokens.txt'
>>> model = word2vec.load_word2vec_format(vectors_text_path, binary=False)
>>> model.most_similar(positive=['equals', 'to|lower']) # or: 'tolower', if using the downloaded embeddings
>>> model.most_similar(positive=['download', 'send'], negative=['receive'])
```
## Citation
[code2vec: Learning Distributed Representations of Code](https://urialon.cswp.cs.technion.ac.il/wp-content/uploads/sites/83/2018/12/code2vec-popl19.pdf)
```
@article{alon2019code2vec,
author = {Alon, Uri and Zilberstein, Meital and Levy, Omer and Yahav, Eran},
title = {Code2Vec: Learning Distributed Representations of Code},
journal = {Proc. ACM Program. Lang.},
issue_date = {January 2019},
volume = {3},
number = {POPL},
month = jan,
year = {2019},
issn = {2475-1421},
pages = {40:1--40:29},
articleno = {40},
numpages = {29},
url = {http://doi.acm.org/10.1145/3290353},
doi = {10.1145/3290353},
acmid = {3290353},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {Big Code, Distributed Representations, Machine Learning},
}
```
This diff is collapsed.
[ 0.2015805 0.22253373 -0.32372433 0.34671628 0.45681736 0.17452964
-0.16139814 0.10287818 0.01626023 0.37226415 -0.02373358 -0.01111386
0.13615388 0.26106855 -0.253702 0.36083597 -0.06759556 0.3335217
-0.30192313 0.2828149 0.12290768 0.1220651 -0.02241256 -0.31138584
-0.23400699 0.33490416 0.06034543 -0.3385696 0.2621387 0.28942487
-0.16569568 0.40578374 -0.01458807 0.15329088 0.18943007 -0.46935108
0.09918395 -0.01374541 0.2378838 0.37237856 -0.01153594 -0.13856722
0.26282826 -0.03826695 0.16697341 -0.2532084 0.31032568 -0.2118883
-0.36583877 -0.174535 0.22863318 -0.09852735 0.11116619 -0.24243443
0.32606342 -0.29987243 0.05340264 -0.24900673 -0.23284276 0.46389112
0.17781721 0.3148493 0.29570702 0.0400935 -0.0548534 -0.37739754
0.26342124 0.12863013 -0.20617746 -0.13997218 -0.02854102 0.41472375
0.3192634 -0.11779338 -0.19936284 -0.25276497 0.04731703 -0.2347503
0.10967557 -0.23312224 -0.17535631 -0.43981794 -0.3349033 0.08596468
0.33149323 -0.07972342 0.00634026 0.0157866 -0.11919723 -0.22180165
-0.3901988 0.11813826 -0.21010827 0.18264326 -0.39787847 0.0130525
0.25412446 0.0927835 -0.07835466 0.13885124 -0.07161943 -0.24469961
-0.14612997 0.05326413 0.01522469 0.29484826 -0.291025 -0.0691819
0.21812826 -0.3280928 0.24279578 0.3181605 -0.25098377 0.2516182
0.10607041 -0.21615104 -0.18586221 0.31604016 -0.27585265 0.00965439
-0.06791524 -0.10740214 0.15566291 -0.16151561 -0.19197056 -0.20275852
-0.0138284 -0.27845314 -0.23098938 0.11115047 0.27841762 -0.02861075
-0.14645505 0.32464424 -0.21263815 -0.10251701 -0.31404015 0.15096483
-0.27956378 -0.12370963 0.11136246 0.1403956 -0.11805914 0.10149604
-0.02895176 -0.13000329 0.2646563 -0.3070487 0.13791753 -0.33972985
-0.37944868 -0.2872243 -0.10952184 -0.11615933 0.18581995 -0.06551746
-0.13827358 -0.29815155 0.4152737 0.3655407 0.3426815 0.22232226
0.28414488 -0.25031215 -0.35215995 0.13496229 0.25756207 0.31911707
0.285034 -0.32338062 -0.20659807 0.02862259 0.22335495 0.1970217
0.0905844 -0.04780547 0.35461375 0.29855427 -0.32920113 -0.2983428
0.13753058 0.29447883 -0.0943373 -0.05628201 -0.10634395 -0.33710814
0.0454411 0.19897303 0.09565253 0.19855483 0.00105676 0.00934055
0.26858565 -0.07263476 -0.4136878 0.28910056 -0.24269481 0.4652352
0.19885883 -0.03047581 -0.1193736 -0.0076411 0.01676562 -0.09510046
0.53499675 -0.13215348 -0.3197335 -0.15369362 -0.1311428 -0.09721443
-0.00254952 -0.14536725 -0.01982467 0.01348079 -0.03932993 0.12105251
-0.04047297 -0.04112267 0.35504374 0.2744438 -0.342138 0.27942863
0.21623652 -0.25159156 -0.06551136 -0.01649794 -0.32965258 0.31129366
0.16670796 0.04186617 0.03577708 -0.11567184 -0.048086 0.3696694
-0.02833302 -0.04588367 0.30051267 0.20886213 0.10753622 0.14473982
-0.40707624 0.05419167 0.17220446 0.21230163 0.16048147 -0.10060892
-0.09214596 -0.18781814 0.34490794 -0.12156998 0.2055939 -0.3017077
-0.31912005 -0.01813748 0.3962601 0.26218972 -0.09537423 0.18080187
0.32304093 -0.08616447 -0.21827932 -0.01563177 -0.4000124 0.21418361
0.24306145 -0.34274328 -0.25159657 -0.33023986 -0.2533924 -0.0144991
0.176018 0.06465079 0.40444827 -0.07205106 -0.41623178 0.39889014
0.08726029 0.2516882 -0.10433988 -0.03200986 -0.13268207 -0.1962914
0.34701574 -0.22838695 0.15200466 -0.23918496 -0.12644415 -0.2179748
0.1519277 0.14931972 0.26793763 -0.24944627 -0.09137631 -0.18759435
0.249044 -0.31225586 -0.25080562 -0.34064436 0.0498443 0.14410283
0.30508506 -0.18967274 0.34083065 -0.08254306 0.09124459 -0.18405181
-0.13789429 -0.32915685 -0.22789153 0.08041432 0.39437088 -0.3438695
0.18836269 -0.35941246 0.14597169 0.49630174 -0.26732016 -0.11287701
0.14886269 0.04328605 -0.2996858 0.19567074 0.08980418 -0.28711596
0.3497475 0.43085307 -0.31302392 0.2965662 0.3644676 -0.28259605
0.3008715 0.29865515 0.25347802 -0.14510433 0.20586424 -0.0086604
0.2512146 0.1760377 -0.05269113 -0.33092025 0.3120215 -0.24675559
-0.24162634 -0.1955315 0.11411992 -0.3718908 -0.16412953 0.13948756
0.08189666 0.07953709 -0.03063224 0.24935104 -0.11635256 0.16083124
-0.02563126 -0.21675596 -0.16628088 -0.13760118 0.24651264 -0.04828732
0.11463395 -0.28219128 0.09203159 -0.33567008 -0.22884354 -0.03010478
0.04833269 -0.29030094 -0.16518897 -0.12674385 0.1204998 -0.18239129
-0.00458637 0.13271153 0.30694276 0.1213311 -0.02846854 0.41562143
-0.46147725 -0.25697005 0.08725139 0.00998286 -0.02106541 0.24183881]
File added
def fit_rfmodelonfulldata_on_all_data_from_the_training_data():
rf_model_on_full_data = RandomForestRegressor(n_estimators = 1000, random_state = 1)rf_model_on_full_data.fit(X,y)
def use_the_next_code_cell_to_print_the_first_five_rows_of_the_data():
X_train.head()
def correlation_matrix():
def plotCorrelationMatrix(df, graphWidth):
filename = df.dataframeName
df = df.dropna('columns')
df = df[[col for col in df if df[col].nunique() 1]]
if df.shape[1] 2:
print(f'No correlation plots shown: The number of non-NaN or constant columns ({df.shape[1]}) is less than 2')
return
corr = df.corr()
plt.figure(num=None, figsize=(graphWidth, graphWidth), dpi=80, facecolor='w', edgecolor='k')
corrMat = plt.matshow(corr, fignum = 1)
plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
plt.yticks(range(len(corr.columns)), corr.columns)
plt.gca().xaxis.tick_bottom()
plt.colorbar(corrMat)
plt.title(f'Correlation Matrix for {filename}', fontsize=15)
plt.show()
,metadata:{collapsed:true,_kg_hide-input:true}
def plotting():
df[['Aerial Battles Won','Duels Won','Recoveries','Tackle Success %']]= df[['Aerial Battles Won','Duels Won','Recoveries','Tackle Success %']].fillna(0).astype(int)cm = sns.light_palette('orange', as_cmap=True)
df.groupby('Club')['Aerial Battles Won','Duels Won','Tackle Success %','Recoveries'].sum().sort_values(by='Recoveries',ascending=False).head(20).style.background_gradient(cmap=cm)
def plotting():
ts_fare_diff = log_ts_fare - log_ts_fare.shift()
ts_fare_diff.dropna(inplace = True)t1 = plot_line(ts_fare_diff.index,ts_fare_diff['fare_amount'],
'blue','Differenced log series')
lay = plot_layout('Differenced log series')
fig = go.Figure(data = [t1],layout=lay)
py.iplot(fig)stationary_test(ts_fare_diff)
def plotting():
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
,metadata:{collapsed:false,_kg_hide-input:false}
def linear_algebra():
import numpy as np
import pandas as pd import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))# Any results you write to the current directory are saved as output.
def test_main_results():
"""Test the results of the real arguments"""
# Due to complexities testing with arguments to get full coverage
# run the script externally with full arguments
os.popen('python3 -m pip install -e .')
os.popen(
'python3 Examples/WSO.py -url cn1234.awtest.com -username citests -password hunter2 -tenantcode shibboleet'
).read()
filename = "uem.json"
assert AUTH.check_file_exists(filename) is True
assert AUTH.verify_config(filename, 'authorization',
AUTH.encode("citests", "hunter2")) is True
assert AUTH.verify_config(filename, 'url', "cn1234.awtest.com") is True
assert AUTH.verify_config(filename, 'aw-tenant-code', "shibboleet") is True
def test_increment_minor(self):
os.environ["RELEASE_TYPE"] = "minor"
v1 = VersionUtils.increment(self.v1)
v2 = VersionUtils.increment(self.v2)
v3 = VersionUtils.increment(self.v3)
v4 = VersionUtils.increment(self.v4)
v5 = VersionUtils.increment(self.v5)
v6 = VersionUtils.increment(self.v6)
self.assertEqual(v1, "1!1.3.0")
self.assertEqual(v2, "1.3.0")
self.assertEqual(v3, "1.3.0")
self.assertEqual(v4, "1.3")
self.assertEqual(v5, "2014.1")
self.assertEqual(v6, "2.2.0")
def test_cant_write_to_nonexisting_dir():
with raises(IOError):
test_dataset.save('/nonexistentrandomdir/jdknvoindvi93/arbitrary.noname.pkl')
def test_get_conll2000(self):
raw = get_conll2000()
self.assertIn('train', raw)
self.assertEqual(len(raw['train']), 8_937)
self.assertIn('test', raw)
self.assertEqual(len(raw['test']), 2_013)
def test_lmod_purge(d):
kern = install(d, "lmod --purge MOD3")
#assert kern['argv'][0] == 'envkernel' # defined above
assert '--purge' in kern['ek'][3:]
assert kern['ek'][-1] == 'MOD3'
def test_conda(d):
kern = install(d, "conda /PATH/BBB")
#assert kern['argv'][0] == 'envkernel' # defined above
assert kern['ek'][1:3] == ['conda', 'run']
assert kern['ek'][-1] == '/PATH/BBB'
def test_use_previous_end_time_as_start_time(cli, entries_file):
entries = """20/01/2014
alias_1 09:00-10:00 foobar
"""
expected = """20/01/2014
alias_1 09:00-10:00 foobar
alias_1 10:00-? ?
"""
entries_file.write(entries)
with freeze_time('2014-01-20'):
cli('start', ['alias_1'])
assert entries_file.read() == expected
def test_max_unique_lines():
sle = SimilarLogErrors()
# Create random log lines
max_lines = sle.MAX_COMMON_LINES
log_text = ''
for i in range(max_lines+5):
# random_words = [''.join(random.choices(string.ascii_uppercase + string.digits, k=5)) for i in range(40)] # Generate 40 5 character words
random_text = ''.join(random.choices(string.ascii_uppercase + string.digits, k=5)) * 50
log_text += f'2017/10/10 00:00:34.251 ERROR {random_text}\n'
# Confirm max_lines is honored properly
with tempfile.TemporaryDirectory() as tmpdir:
filepath = create_tmp_log_file(text=log_text, dir_path=tmpdir)
variables = {}
variables['LogFiles'] = [filepath]
result = sle.run(variables)
print(f"Result:\n{result}")
assert 'Other Error Lines' in result
assert 'Count: 5' in result
def test_embeddings_with_spacy(self):
with self.assertRaises(ValueError):
load_wv_with_spacy("wiki.da.small.swv")
embeddings = load_wv_with_spacy("wiki.da.wv")
sentence = embeddings('jeg gik ned af en gade')
for token in sentence:
self.assertTrue(token.has_vector)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment