Module 3 — Data and Architecture

Exam weight: 10–15% Study time: 30–45 minutes Lessons: Data flow · Suggestion lifecycle & LLM limits

Exam tactic. This module is more technical than the others. The exam asks about data flow and LLM limitations. A good study habit: sketch the data flow on paper and memorize the names of every limitation.

L01 — Data flow and processing

What data does GitHub Copilot use?

Copilot collects context to build a prompt for the LLM. Sources of context:

Source	Description
Active file	Code around the cursor
Open files	Other files open in the IDE (limited)
Recent edits	Git history and recent changes
Comments	Comments adjacent to code
Imports / requires	Top of the file (language dependencies)
`#` references	Sources the user added manually

How is the data shared?

Copilot Free, Pro, and Pro+: GitHub may use prompts and suggestions to improve the model (opt-out available in user settings).
Copilot Business: prompts and suggestions are not used for model training.
Copilot Enterprise: same as Business, plus broader governance and privacy controls.

Key exam point. Business and Enterprise do not use data for training. This is one of the most frequently tested facts on GH-300.

Prompt building pipeline

1. User types code or sends a Chat message
2. Copilot extension gathers context (active file, open files, history)
3. Context is tokenized (turned into numbers the LLM understands)
4. Prompt is assembled: system instructions + context + user input
5. Prompt is sent to the GitHub Copilot service over HTTPS
6. Service forwards the prompt to the LLM
7. LLM generates a response
8. Proxy filters the response (safety filters)
9. Filtered response is returned to the IDE
10. IDE displays the suggestion to the user

The context window is finite. Copilot prioritizes:

Code closest to the cursor gets the most weight.
Semantically relevant code is prioritized.
Older or more distant code is dropped if needed.

Proxy filtering

Pre-processing (before the LLM):

Content Exclusions check — excluded files are stripped.
Privacy filters — possible sensitive data is detected.

Post-processing (after the LLM):

Duplication Detection — checks whether the suggestion resembles known public code.
Security filters — obvious vulnerabilities are flagged.
Content filters — inappropriate or harmful content is removed.

Data flow — memorize this diagram

IDE (context) → Tokenize → Prompt build → HTTPS → GitHub service
→ Pre-processing (exclusions, privacy) → LLM → Post-processing
(duplication, security) → HTTPS → IDE (suggestion)

L02 — Suggestion lifecycle and LLM limitations

Suggestion lifecycle, step by step

Trigger — typing (inline), Chat message, or action (Agent / Plan Mode).
Context gathering — extension collects code, comments, imports, open files.
Tokenization — context becomes tokens (~3–4 characters each).
Prompt send — over HTTPS, with auth and subscription validation.
LLM inference — model generates a token response.
Filtering — duplication, security, content filters.
Display in IDE — inline grey text or chat message.
User decision — accept (Tab) or reject (Esc); anonymous feedback may be collected (individual plans).

LLM limitations in the Copilot context

Context window size. The LLM has a maximum prompt size. In a large project, not everything fits — Copilot operates on a partial view.
Training data cutoff. Libraries, API changes, and security patches after the cutoff are unknown to the model.
Non-determinism. Same prompt can produce different answers (influenced by the temperature parameter).
No real understanding. The model produces statistically plausible code, not semantically validated code.
No memory between sessions (by default). Without features like Spaces or persistent sessions, every new session starts blank.

Copilot-specific limits

Limit	Description
No internet access	Cannot fetch live data from the web
No database access	Cannot read databases directly (unless via MCP)
No permission checks	Does not verify whether the user is authorized to use specific APIs or data
No memory by default	No automatic memory across sessions
Variable language support	Some programming and markup languages have stronger support than others

Exam-ready checklist (M03)

You know the data Copilot collects into context.
You can describe the data flow from IDE to LLM and back.
You know which plans do not use data for model training: Business and Enterprise.
You know all the steps of the suggestion lifecycle.
You can name the three big LLM limits: context window, training cutoff, non-determinism.

Module 3 — Data and Architecture

L01 — Data flow and processing

What data does GitHub Copilot use?

How is the data shared?

Prompt building pipeline

Proxy filtering

Data flow — memorize this diagram

L02 — Suggestion lifecycle and LLM limitations

Suggestion lifecycle, step by step

LLM limitations in the Copilot context

Copilot-specific limits

Exam-ready checklist (M03)

Official source documents