Auto-README — Java Summarizer

NLP · CODE SUMMARIZATION

Four ways to summarize Java source code

Upload a .java file and watch corpus-fitted extractive baselines, a semantic embedding model, and a fine-tuned transformer each generate a java code summary, compared live side by side.

Try it now See how it works

4models

liveinference

01 — Architecture

How a file becomes four summaries

Extractive models summarize the whole file from split statements. CodeT5 runs once per Java method, the same setup used in the CodeXGLUE evaluation.

{ }

Java upload

A single .java file

⛓

Preprocess

Split statements; CodeT5 splits by method

TF-IDF

Term scoring

LexRank

Graph centrality

Sentence-T

Embeddings

CodeT5

Generation

▦

Summary

Four summaries

①

Preprocessing

Split on ; { } and newlines
Merge tiny fragments (< 3 tokens)
CamelCase / snake_case identifier splitting
Java keyword + English stopword filtering

②

Corpus fitting

TF-IDF & LexRank IDF from CodeXGLUE Java train + validation
Weights cached to cache/idf_weights_train_val.pkl
Neural models use frozen pretrained checkpoints
One-time load, then served from memory

③

Output

Extractive models return top-N statements from the whole file
CodeT5 generates one English sentence per method (evaluation setup)
Per-model latency tracked for each run
Results compared in a single view

02 — Models

The four summarizers

Each model represents a different tier of prior knowledge. Click a card to expand its step-by-step algorithm, strengths, and limitations.

Extractive Corpus-fitted extractive Instant

Picks statements packed with rare, high-signal terms.

Scores each code statement by TF-IDF using IDF weights fitted on the CodeXGLUE Java train + validation corpus.

How it works

1Fit inverse-document-frequency (IDF) weights over the Java train + validation corpus.
2Tokenize each statement: split identifiers, lowercase, drop stopwords.
3Score every statement as the sum of term-frequency x IDF.
4Return the top-N highest-scoring statements as the summary.

Strengths

Fast and fully offline
Interpretable scores
No GPU required

Limitations

Output is code-like, not prose
Ignores word order and context

Input: Statement fragments

Extractive Corpus-fitted extractive Fast

Selects statements most representative of the whole file.

Builds a similarity graph over statements and runs PageRank to pick the most central fragments.

How it works

1Build a TF-IDF vector for each statement using shared corpus IDF.
2Compute pairwise cosine similarity to form a statement graph.
3Threshold weak edges, then run PageRank over the graph.
4Return the most central (highest-ranked) statements.

Strengths

Captures redundancy / centrality
Offline and interpretable
Robust on longer files

Limitations

Needs several statements to rank
Still extractive, not generative

Input: Statement fragments

Extractive General-language pretrained Moderate

Uses semantic meaning to find the most central statements.

Encodes statements with all-MiniLM-L6-v2 and selects those closest to the centroid embedding.

How it works

1Embed each statement with the all-MiniLM-L6-v2 transformer.
2Average the embeddings into a single centroid vector.
3Rank statements by cosine similarity to the centroid.
4Return the statements closest to the semantic center.

Strengths

Understands English semantics
Order-aware encoder
No corpus fitting needed

Limitations

Pretrained on prose, not code
Heavier than TF-IDF/LexRank

Input: Statement fragments

Abstractive Code-specific fine-tuned Slowest

Writes a fresh English sentence describing the code.

Generates natural-language summaries from raw Java source using a CodeT5 checkpoint fine-tuned on CodeXGLUE.

How it works

1Split the file into individual Java methods.
2Byte-level BPE tokenize each method (first 256 tokens).
3Decode with beam search — one English sentence per method, same as evaluation.
4Show each method summary separately in the results view.

Strengths

True natural-language output
Fine-tuned on Java code-comment pairs
Best quality summaries

Limitations

Slow on CPU
256-token input limit
Can hallucinate details

Input: Per-method Java source (256-token window each)

03 — Try it

Summarize your Java file

Upload a .java file. Extractive models use the whole file; CodeT5 summarizes each method separately.

Code Summarization

Four ways to summarize Java source code

How a file becomes four summaries

Java upload

Preprocess

TF-IDF

LexRank

Sentence-T

CodeT5

Summary

Preprocessing

Corpus fitting

Output

The four summarizers

TF-IDF

How it works

Strengths

Limitations

LexRank

How it works

Strengths

Limitations

SentenceTransformers

How it works

Strengths

Limitations

CodeT5

How it works

Strengths

Limitations

Summarize your Java file