The Grader's Assistant

Using LLMs for Effective Rubric Design and Grading

Introduction & Goals

Today's exercise is to use an LLM as a partner in the grading process. This involves two distinct, high-level skills:

Prompt Engineering: Designing a high-quality rubric (which is a form of "mega prompt").
Critical Analysis: Evaluating the AI's application of that rubric.

Your Tools & Deliverable

An LLM (e.g., Gemini 2.5 Pro, or via the Canvas).
Access to the class folder of project proposals (link below).
Your Deliverable: A final, refined "Rubric v2.0" Google Doc, which you will save in the class submission folder.

Open Class Proposals Folder → Open Rubric Submission Folder →

Part 1: Generating the Rubric (v1.0)

Before you can grade, you need a standard. The first step is to use an LLM as a collaborator to brainstorm and draft a comprehensive rubric for a research proposal.

Task 1: Brainstorm Criteria

Ask the LLM to help you define what makes a "good" project proposal.

Example Prompt

Act as an experienced professor. What are the key criteria you use to evaluate a graduate-level research project proposal? List the criteria and briefly explain what you look for in each one.

Task 2: Draft "Rubric v1.0"

Use the brainstormed list to have the LLM generate a structured, point-based rubric. Save this as "Rubric v1.0" in a new Google Doc in your class folder.

Example Prompt

Based on the criteria we just discussed (e.g., Novelty, Feasibility, Methodology, Clarity, Impact), generate a formal grading rubric.

The rubric MUST be out of 100 points total. Assign a point value to each criterion, ensuring they all sum up to 100.

For each criterion, create a 5-point scoring system (or similar) with clear descriptors for what a low, medium, and high score means.

Part 2: Testing the Rubric (Applying v1.0)

A rubric is a "prompt" for a grader. Now, you will test how well the AI performs as the grader when given your rubric as its instructions.

Task: Grade Two Proposals

Go to the class folder and select two project proposals (not your own). For each proposal, start a new, fresh LLM session (to avoid context leaks) and use the prompt below.

Example Prompt

Act as a diligent teaching assistant. Your task is to grade the attached project proposal using the *exact* rubric I am providing.

[Attach your "Rubric v1.0" Google Doc here]

---
Now, grade the attached proposal. For each criterion in the rubric:
1.  Provide a specific score (e.g., "Feasibility: 20/25").
2.  Write a 2-3 sentence justification for that score, citing specific evidence or quotes from the proposal.
3.  Conclude with a summary of the proposal's main strengths and weaknesses.
4.  Provide a "Total Score" out of 100.

[Attach the project proposal here]

Part 3: Analysis & Refinement (Iterating to v2.0)

This is the most critical step. You must now act as the "professor" again, evaluating the "TA" (the AI) and the tool you gave it (the rubric).

Task 1: Analyze the AI-Generated Grades

Read the AI's graded outputs and compare them to the proposals. Ask yourself:

Validity: Are the scores fair? Does the AI's justification make sense, or is it superficial?
Hallucination: Did the AI cite evidence that wasn't actually in the proposal?
Misinterpretation: Did the AI misunderstand a criterion? (e.g., "I said 'Impact,' but the AI only focused on 'Novelty'.")
Gaps: What did the AI *miss*? What important feedback is missing because the rubric didn't ask for it?

Task 2: Refine the Rubric (AI-Assisted Iteration)

Based on your analysis, your rubric has flaws. Use the LLM (e.g., in the Canvas) to fix them. Save the result as "Rubric v2.0" in your Google Doc.

Example Iteration Prompt

[Attach your "Rubric v1.0" Google Doc]

---
You are a prompt engineer. I have attached a grading rubric I designed, but it has flaws. When I tested it, the AI was too lenient on 'Methodology'. Also, the points don't add up to 100.

Please modify the rubric to create 'Rubric v2.0'. Specifically:
1.  In the 'Methodology' section, add a sub-point that explicitly checks for "discussion of limitations."
2.  Adjust the point values for all criteria (e.g., Methodology: 30, Feasibility: 20, etc.) so that they logically add up to exactly 100 points.
3.  Add a new criterion called "Clarity of Writing" (worth 10 points) and adjust the other points accordingly.

Part 4: Final Application - Grading the Class

Your rubric is now refined and tested. The final step is to apply it to a larger set of proposals to evaluate its performance at scale and record the grades.

Task: Grade Five Proposals

Go to the class proposals folder and select your submitted proposal in addition to five different proposals (not your own and not the ones you used for testing). For each proposal, use your refined "Rubric v2.0" to generate a grade. Use a new, fresh LLM session for each one. Don't know which ones to grade? Try asking your LLM to pick 5 random ones for you.

Example Prompt

Act as a diligent teaching assistant. Your task is to grade the attached project proposal using the *exact* rubric I am providing.

[Attach your final "Rubric v2.0" Google Doc here]

---
Now, grade the attached proposal. For each criterion in the rubric:
1.  Provide a specific score.
2.  Write a 2-3 sentence justification for that score, citing evidence.
3.  Conclude with a summary of strengths and weaknesses.
4.  Provide a "Total Score" out of 100.

[Attach the project proposal here]

Record Your Grades

As you generate each grade, open the class spreadsheet and record the AI's final score for each proposal you graded. This will help us analyze consistency as a class.

Open Grading Spreadsheet →

Reflection

For Class Discussion

Think about the following questions. We will use them as a basis for our class discussion.

What was the biggest flaw in your "Rubric v1.0" that you had to fix?
Where did the AI grader perform well? Where did it fail? (e.g., Was it good at spotting typos but bad at judging "novelty"?)
How consistent was the AI in applying your rubric across all five proposals in Part 4? (Check the spreadsheet).
What are the ethical implications of using an AI to grade subjective work like this?
What is the difference between using an AI to *grade* (give a score) versus *give feedback* (provide formative comments)? Which is more appropriate?