Preprocessing
- Split on
;{}and newlines - Merge tiny fragments (< 3 tokens)
- CamelCase / snake_case identifier splitting
- Java keyword + English stopword filtering
Upload a .java file and watch corpus-fitted extractive baselines,
a semantic embedding model, and a fine-tuned transformer each generate a
java code summary, compared live side by side.
Extractive models summarize the whole file from split statements. CodeT5 runs once per Java method, the same setup used in the CodeXGLUE evaluation.
A single .java file
Split statements; CodeT5 splits by method
Term scoring
Graph centrality
Embeddings
Generation
Four summaries
; { } and newlinescache/idf_weights_train_val.pklEach model represents a different tier of prior knowledge. Click a card to expand its step-by-step algorithm, strengths, and limitations.
Picks statements packed with rare, high-signal terms.
Scores each code statement by TF-IDF using IDF weights fitted on the CodeXGLUE Java train + validation corpus.
Selects statements most representative of the whole file.
Builds a similarity graph over statements and runs PageRank to pick the most central fragments.
Uses semantic meaning to find the most central statements.
Encodes statements with all-MiniLM-L6-v2 and selects those closest to the centroid embedding.
Writes a fresh English sentence describing the code.
Generates natural-language summaries from raw Java source using a CodeT5 checkpoint fine-tuned on CodeXGLUE.
Upload a .java file. Extractive models use the whole file; CodeT5 summarizes each method separately.