NLP · CODE SUMMARIZATION

Four ways to summarize Java source code

Upload a .java file and watch corpus-fitted extractive baselines, a semantic embedding model, and a fine-tuned transformer each generate a java code summary, compared live side by side.

4models
liveinference
01 — Architecture

How a file becomes four summaries

Extractive models summarize the whole file from split statements. CodeT5 runs once per Java method, the same setup used in the CodeXGLUE evaluation.

{ }

Java upload

A single .java file

Preprocess

Split statements; CodeT5 splits by method

TF

TF-IDF

Term scoring

LR

LexRank

Graph centrality

ST

Sentence-T

Embeddings

T5

CodeT5

Generation

Summary

Four summaries

Preprocessing

  • Split on ; { } and newlines
  • Merge tiny fragments (< 3 tokens)
  • CamelCase / snake_case identifier splitting
  • Java keyword + English stopword filtering

Corpus fitting

  • TF-IDF & LexRank IDF from CodeXGLUE Java train + validation
  • Weights cached to cache/idf_weights_train_val.pkl
  • Neural models use frozen pretrained checkpoints
  • One-time load, then served from memory

Output

  • Extractive models return top-N statements from the whole file
  • CodeT5 generates one English sentence per method (evaluation setup)
  • Per-model latency tracked for each run
  • Results compared in a single view
02 — Models

The four summarizers

Each model represents a different tier of prior knowledge. Click a card to expand its step-by-step algorithm, strengths, and limitations.

Extractive Corpus-fitted extractive Instant

Picks statements packed with rare, high-signal terms.

Scores each code statement by TF-IDF using IDF weights fitted on the CodeXGLUE Java train + validation corpus.

How it works

  1. 1Fit inverse-document-frequency (IDF) weights over the Java train + validation corpus.
  2. 2Tokenize each statement: split identifiers, lowercase, drop stopwords.
  3. 3Score every statement as the sum of term-frequency x IDF.
  4. 4Return the top-N highest-scoring statements as the summary.

Strengths

  • Fast and fully offline
  • Interpretable scores
  • No GPU required

Limitations

  • Output is code-like, not prose
  • Ignores word order and context
Input
Statement fragments
Extractive Corpus-fitted extractive Fast

Selects statements most representative of the whole file.

Builds a similarity graph over statements and runs PageRank to pick the most central fragments.

How it works

  1. 1Build a TF-IDF vector for each statement using shared corpus IDF.
  2. 2Compute pairwise cosine similarity to form a statement graph.
  3. 3Threshold weak edges, then run PageRank over the graph.
  4. 4Return the most central (highest-ranked) statements.

Strengths

  • Captures redundancy / centrality
  • Offline and interpretable
  • Robust on longer files

Limitations

  • Needs several statements to rank
  • Still extractive, not generative
Input
Statement fragments
Extractive General-language pretrained Moderate

Uses semantic meaning to find the most central statements.

Encodes statements with all-MiniLM-L6-v2 and selects those closest to the centroid embedding.

How it works

  1. 1Embed each statement with the all-MiniLM-L6-v2 transformer.
  2. 2Average the embeddings into a single centroid vector.
  3. 3Rank statements by cosine similarity to the centroid.
  4. 4Return the statements closest to the semantic center.

Strengths

  • Understands English semantics
  • Order-aware encoder
  • No corpus fitting needed

Limitations

  • Pretrained on prose, not code
  • Heavier than TF-IDF/LexRank
Input
Statement fragments
Abstractive Code-specific fine-tuned Slowest

Writes a fresh English sentence describing the code.

Generates natural-language summaries from raw Java source using a CodeT5 checkpoint fine-tuned on CodeXGLUE.

How it works

  1. 1Split the file into individual Java methods.
  2. 2Byte-level BPE tokenize each method (first 256 tokens).
  3. 3Decode with beam search — one English sentence per method, same as evaluation.
  4. 4Show each method summary separately in the results view.

Strengths

  • True natural-language output
  • Fine-tuned on Java code-comment pairs
  • Best quality summaries

Limitations

  • Slow on CPU
  • 256-token input limit
  • Can hallucinate details
Input
Per-method Java source (256-token window each)
03 — Try it

Summarize your Java file

Upload a .java file. Extractive models use the whole file; CodeT5 summarizes each method separately.

Drop a .java file here

or click to browse · UTF-8 text · multi-method classes supported