Module 3 — Data and Architecture
Exam tactic. This module is more technical than the others. The exam asks about data flow and LLM limitations. A good study habit: sketch the data flow on paper and memorize the names of every limitation.
L01 — Data flow and processing
What data does GitHub Copilot use?
Copilot collects context to build a prompt for the LLM. Sources of context:
| Source | Description |
|---|---|
| Active file | Code around the cursor |
| Open files | Other files open in the IDE (limited) |
| Recent edits | Git history and recent changes |
| Comments | Comments adjacent to code |
| Imports / requires | Top of the file (language dependencies) |
# references | Sources the user added manually |
How is the data shared?
- Copilot Free, Pro, and Pro+: GitHub may use prompts and suggestions to improve the model (opt-out available in user settings).
- Copilot Business: prompts and suggestions are not used for model training.
- Copilot Enterprise: same as Business, plus broader governance and privacy controls.
Key exam point. Business and Enterprise do not use data for training. This is one of the most frequently tested facts on GH-300.
Prompt building pipeline
1. User types code or sends a Chat message
2. Copilot extension gathers context (active file, open files, history)
3. Context is tokenized (turned into numbers the LLM understands)
4. Prompt is assembled: system instructions + context + user input
5. Prompt is sent to the GitHub Copilot service over HTTPS
6. Service forwards the prompt to the LLM
7. LLM generates a response
8. Proxy filters the response (safety filters)
9. Filtered response is returned to the IDE
10. IDE displays the suggestion to the user
The context window is finite. Copilot prioritizes:
- Code closest to the cursor gets the most weight.
- Semantically relevant code is prioritized.
- Older or more distant code is dropped if needed.
Proxy filtering
Pre-processing (before the LLM):
- Content Exclusions check — excluded files are stripped.
- Privacy filters — possible sensitive data is detected.
Post-processing (after the LLM):
- Duplication Detection — checks whether the suggestion resembles known public code.
- Security filters — obvious vulnerabilities are flagged.
- Content filters — inappropriate or harmful content is removed.
Data flow — memorize this diagram
IDE (context) → Tokenize → Prompt build → HTTPS → GitHub service
→ Pre-processing (exclusions, privacy) → LLM → Post-processing
(duplication, security) → HTTPS → IDE (suggestion)
L02 — Suggestion lifecycle and LLM limitations
Suggestion lifecycle, step by step
- Trigger — typing (inline), Chat message, or action (Agent / Plan Mode).
- Context gathering — extension collects code, comments, imports, open files.
- Tokenization — context becomes tokens (~3–4 characters each).
- Prompt send — over HTTPS, with auth and subscription validation.
- LLM inference — model generates a token response.
- Filtering — duplication, security, content filters.
- Display in IDE — inline grey text or chat message.
- User decision — accept (
Tab) or reject (Esc); anonymous feedback may be collected (individual plans).
LLM limitations in the Copilot context
- Context window size. The LLM has a maximum prompt size. In a large project, not everything fits — Copilot operates on a partial view.
- Training data cutoff. Libraries, API changes, and security patches after the cutoff are unknown to the model.
- Non-determinism. Same prompt can produce different answers (influenced by the temperature parameter).
- No real understanding. The model produces statistically plausible code, not semantically validated code.
- No memory between sessions (by default). Without features like Spaces or persistent sessions, every new session starts blank.
Copilot-specific limits
| Limit | Description |
|---|---|
| No internet access | Cannot fetch live data from the web |
| No database access | Cannot read databases directly (unless via MCP) |
| No permission checks | Does not verify whether the user is authorized to use specific APIs or data |
| No memory by default | No automatic memory across sessions |
| Variable language support | Some programming and markup languages have stronger support than others |
Exam-ready checklist (M03)
- You know the data Copilot collects into context.
- You can describe the data flow from IDE to LLM and back.
- You know which plans do not use data for model training: Business and Enterprise.
- You know all the steps of the suggestion lifecycle.
- You can name the three big LLM limits: context window, training cutoff, non-determinism.
