Char-rnn: Generating java code snippets

experimenting with a char-rnn to generate realistic java code

MM

Generating realistic code

Karpathy's blog on recurrent neural networks has lots of awesome examples on how to apply this architecture to various use cases. Recurrent neural networks allow us to process sequential data. Unlike regular neural networks that need to have a fixed size input and output vector, rnns can use sequences of vectors as inputs or outputs. With rnns, you can now have different kinds of mappings between inputs and outputs (ie. one-to-many, many to-one, many-to-many). A regular neural network has a one-to-one mapping since the input and output vectors are a fixed size. This change means that we can model temporal behavior and use internal memory to process new inputs. Karpathy's blog does an excellent job at explaning all the glory of the RNN.

A particular use case that Karpathy discussed in this blog was generating code using a character-level rnn. This type of rnn takes individual characters and models the probability of the next character appearing given a set of previous characters. He gives the example of modeling how to generate the word "hello". The model takes in 4 training samples: "h", "he", "hel", "hell". From these training examples, the model should learn that:

  • "e" is the most probable character after the sequence "h"
  • "l" is the most probable character after the sequence "he"
  • "l" is the most probable character after the sequence "hel"
  • "o" is the most probable character after the sequence "hell"

Training

Karpathy has a github with torch code to implement this char-rnn, but I decided to use another github repo that adopts his code using tensorflow. Instead of using the sample Shakespeare corpus they provided, I downloaded this giant dump of Java code projects found on github. I flattened the entire directory structure and concatenated all the files together to create a giant corpus of java code. This text file ended up being 6gb! To start training, I only used 40,000 lines from the data file so that it wouldn't take too long. Using lstm cells, 4 hidden layers of 128 neurons each, a batch size of 20 characters, and a small learning rate of 0.002, I was able to get the perplexity of the validation data down to 3.62. A low perplexity means that the probability distribution found by the model is good at predicting the next sample.

Using the text "* This is free software;" as the start point, when sampling using this model I get the following results

                                  
Sampled text is:
* This is free software;

@TestAnnotationInClass.caldRandom16();
        } else
elements.put();
System.outline text = cannot v = new Clpr class(Double edge,
Dependency();
Coothain isInfo.reconsumAffinitorLimyTot(project);
}

@Overridable multer
createBuilder.bundle(23, t2, ret + ")
*
number = nrai = wigtels;
String fifelinations = parent.getAbstract();
p; throws ParametersForField(Token e : functionAi.main.classLaben());
if (p << ACTIONS);
sickList.add(mNamesTriple);
return new Lide(B

                                

                                
This returns a sample of 600 characters. The model understands that statements need to end in semicolons. It also seems to have some sense of how to declare different kinds of variables, utilize decorators, and constructing if statement near the end.

When training this first model, you can see by the training and validation perplexity curves that the model has plateaued near the end of the training. The training perplexity curve shows a slight uptick near the end, meaning that the model still has more to learn and could possibly be underfit. We can make the next model more complex and learn more patterns without worrying too much about overfitting yet. Keeping in mind that only a sliver of the data was used, we could train an overfit model on a small chunk of data and then let it loose on the whole dataset. The significant increase of data could help regularize this overfit model.