📌 Snapshot
- Data is the foundational concept here — what it is, why it matters, and the difference between raw data and processed information.
- Data is classified into structured and unstructured types, a distinction CUET directly tests through scenario-based identification questions.
- The data lifecycle — collection, storage, and processing — comes with real-world examples, making it a rich source of application-based MCQs.
- Five statistical techniques matter (Mean, Median, Mode, Range, Standard Deviation), each individually testable as a direct formula or scenario question.
- NTA favours "which statistical technique to use" scenario questions and definition-based discrimination between structured/unstructured data and measures of central tendency vs. variability.
📖 Detailed Notes
2.1 Core concepts
- Data defined: Data is a collection of characters, numbers, and other symbols that represent values of some situations or variables. The singular of "data" is "datum". Data need to be gathered, processed, and analysed for making decisions. (NCERT §5.1, p. 82)
- Knowledge base: A knowledge base is a store of information consisting of facts, assumptions, and rules which an AI system can use for decision making. (NCERT §5.1, p. 82 sidebar)
- Importance of data: Large amounts of data, when processed with a computer, reveal possibilities or hidden traits not otherwise visible. Examples include ATM transactions, meteorological satellite monitoring, dynamic pricing by airlines and cab apps, and market analysis by businesses. (NCERT §5.1.1, p. 82–83)
- Examples of data: Name/age/gender/contact details; banking and ticketing transaction data; images, graphics, audio, video; documents and web pages; online posts and messages; signals from sensors; satellite and meteorological data. (NCERT §5.1, p. 82)
- Structured data: Data organised in a well-defined format, usually stored in tabular (rows and columns) format where each column is an attribute/characteristic/variable and each row is an observation. Example: inventory table with columns ModelNo, ProductName, UnitPrice, Discount(%), Items_in_Inventory. (NCERT §5.1.2(A), p. 84)
- Unstructured data: Data not in a fixed row-and-column structure. Examples include web pages, text documents, business reports, books, audio/video files, social media messages. Unstructured data are sometimes described with the help of metadata. (NCERT §5.1.2(B), p. 85)
- Metadata: Data about data. For an email: subject, recipient, main body, attachment. For an image file: image size (KB/MB), image type (JPEG, PNG), image resolution. (NCERT §5.1.2(B), p. 85)
- Data collection: Identifying already-available data or collecting from appropriate sources. Data may exist in a diary/register (needs digitising), already in a digital file such as CSV, or may need a new software system to record it. (NCERT §5.2, p. 85–86)
- CSV: Comma Separated Values — a digital format in which data can already be available and ready for use. (NCERT §5.2, p. 85)
- Data storage: The process of storing data on storage devices so that data can be retrieved later. Common storage devices include Hard Disk Drive (HDD), Solid State Drive (SSD), CD/DVD, Tape Drive, Pen Drive, Memory Card. File processing limitations can be overcome through DBMS. (NCERT §5.3, p. 86)
- Data processing: Raw data (numbers/text/image) is transformed through processing into information (tables/charts/text). The Data Process Cycle has three stages — Input (Data Collection, Data Preparation, Data Entry), Processing (Store, Retrieve, Classify, Update), Output (Reports, Results, Processing System). (NCERT §5.4, p. 87, Figure 5.1)
- Measures of central tendency: A single value that gives an idea about the data. The three most common measures are Mean, Median, and Mode. (NCERT §5.5.1, p. 88)
- Mean: Average of numeric values of an attribute. Formula: sum of all n values divided by n. Mean is not suitable when there are outliers in the data. (NCERT §5.5.1(A), p. 88)
- Outlier: An exceptionally large or small value compared to other values; usually considered an error that can influence average calculations. (NCERT §5.5.1(A), p. 88 note)
- Median: When all values are sorted in ascending or descending order, the middle value is the Median. For odd number of values it is the middle position value; for even number of values it is the average of the two middle values. Median represents the central value at which the given data is equally divided into two parts. (NCERT §5.5.1(B), p. 89)
- Mode: The value that appears most number of times in the given data. Computed on the basis of frequency of occurrence. A dataset has no mode if each value occurs only once. There may be multiple modes if more than one value shares the highest frequency. Mode can be found for numeric as well as non-numeric data. (NCERT §5.5.1(C), p. 89)
- Measures of variability (dispersion): Refer to the spread or variation of values around the mean. Two common measures are Range and Standard Deviation. Two data sets can have the same mean/median/mode but completely different levels of dispersion. (NCERT §5.5.2, p. 89)
- Range: Difference between the maximum and minimum values (M - S). Can be calculated only for numerical data. Badly influenced by outliers since it uses only two extreme values. (NCERT §5.5.2(A), p. 90)
- Standard deviation (σ): Positive square root of the average of the squared difference of each value from the mean. Considers all given data values (unlike Range). Smaller σ means less spread; larger σ means more spread. Formula: σ = sqrt(Σ(xi − x̄)² / n). (NCERT §5.5.2(B), p. 90)
- Python for data analysis: Python has libraries specially built for data processing and analysis. It is one programming tool used for efficient analysis of large volumes of data using statistical techniques. (NCERT §5.5, p. 91)
- Why averages can mislead (NCERT §5.5.1(A), p. 88). If one student in a class of 30 scores 100 and the rest score around 50, the mean rises significantly above 50 even though most students are around 50 — this is the classic outlier-distortion scenario. Switch to median in such cases. Mean is best when the data is symmetric and outlier-free.
- Mode for categorical data (NCERT §5.5.1(C), p. 89). The mode is the only one of the three measures of central tendency that meaningfully applies to categorical/non-numeric data, e.g., the most popular favourite colour in a class, the most viewed video category. Mean and median are inherently numeric.
- Range vs IQR (NCERT §5.5.2(A), p. 90). The NCERT only introduces range, not interquartile range; CUET sticks to NCERT scope. Range = max − min. Range can be zero when all values are identical (no spread).
- Standard deviation interpretation (NCERT §5.5.2(B), p. 90). Smaller σ means data points cluster around the mean; larger σ means high spread. σ uses every data point — that is why it is preferred over range as a measure of dispersion.
- Data processing examples (NCERT §5.4, Figure 5.2, p. 87). Three canonical cases: competitive-exam website (input = candidate details/test answers, output = score/rank report), bank ATM withdrawal (input = card + PIN + amount, output = receipt + cash), train ticket issue (input = passenger details/source/destination, output = ticket + seat assignment). These illustrate the IPO cycle.
2.2 Definitions to memorise
| Term | Definition | Page |
|---|---|---|
| Data | A collection of characters, numbers, and other symbols that represent values of some situations or variables | 82 |
| Datum | Singular of data | 82 |
| Knowledge base | A store of information consisting of facts, assumptions, and rules which an AI system can use for decision making | 82 |
| Structured data | Data organised in a well-defined, tabular (rows and columns) format | 84 |
| Unstructured data | Data not in a fixed row-and-column structure (e.g., web pages, audio/video, social media messages) | 85 |
| Metadata | Data about data (e.g., image size, image type, email subject/recipient) | 85 |
| CSV | Comma Separated Values; a common digital file format for storing data | 85 |
| Data storage | Process of storing data on storage devices so it can be retrieved later | 86 |
| DBMS | Database Management System; overcomes limitations of file processing | 86 |
| Data processing | Transformation of raw data into useful information through the Input-Processing-Output cycle | 87 |
| Mean | Average of numeric values; sum of all values divided by total number of values | 88 |
| Outlier | An exceptionally large or small value compared to other data values; influences average calculations | 88 |
| Median | Middle value when data is sorted in ascending or descending order | 89 |
| Mode | Value that appears most number of times in the data; applicable to numeric and non-numeric data | 89 |
| Range | Difference between maximum and minimum values (M - S); measure of dispersion for numerical data only | 90 |
| Standard deviation (σ) | Positive square root of the average of the squared differences of each value from the mean | 90 |
| Information | Output of processing data; usable form such as tables, charts, or reports | 87 |
| Attribute | A column in structured data representing a characteristic/variable | 84 |
| Observation | A row in structured data representing a single record | 84 |
| Dispersion / Variability | How spread out data values are around the mean | 89 |
| Measure of central tendency | A single representative value for a dataset (mean, median, or mode) | 88 |
| Data Process Cycle | Input → Processing → Output sequence | 87 |
| HDD | Hard Disk Drive — magnetic secondary storage device | 86 |
| SSD | Solid State Drive — flash-based secondary storage | 86 |
| Tape Drive | Magnetic tape storage device used for large data backups | 86 |
| File processing | Storing data in flat files; limitations overcome by DBMS | 86 |
| Variance | The arithmetic mean of squared deviations from the mean (σ² before taking square root) | 90 |
| Frequency | Number of times a value appears in a dataset; basis for computing mode | 89 |
2.3 Diagrams / processes to remember
- Figure 5.1 — Steps in Data Processing (p. 87): Shows two diagrams. First: RAW DATA (Numbers/Text/Image) → Data Processing → INFORMATION (In the form of table/chart/text). Second: Data Process Cycle with Input (Data Collection, Data Preparation, Data Entry) → Processing (Store, Retrieve, Classify, Update) → Output (Reports, Results, Processing System).
- Figure 5.2 — Data Based Problem Statements (p. 87): Three real-world scenarios (competitive exam website, bank ATM withdrawal, train ticket issue) each broken into Problem Statement, Inputs, Processing steps, and Output. Useful for identifying input-processing-output in application questions.
- Table 5.1 — Structured data about kitchen items in a shop (p. 84): Columns are ModelNo, ProductName, UnitPrice, Discount(%), Items_in_Inventory. Illustrates the attribute (column) and observation (row) structure of structured data.
- Table 5.3 — Standard deviation of attendance of 9 students (p. 91): Step-by-step calculation of σ for the height dataset [90, 102, 110, 115, 85, 90, 100, 110, 110] giving σ = 10.2 cm. Memorise the procedural steps: subtract mean from each value, square, sum, divide by n, take square root.
2.4 Common confusions / NTA trap points
- Mean vs. Median for outliers: Mean is sensitive to outliers (one extreme value distorts it); Median is not. NTA often presents a dataset with an extreme value and asks which measure is more appropriate — answer is Median.
- Range vs. Standard deviation: Both measure variability, but Range uses only two values (max and min) and is therefore badly affected by a single outlier. Standard deviation uses all values. A trap question asks which is a "better" or "more reliable" measure of dispersion — standard deviation is preferred.
- Mode for non-numeric data: Mean and Range can be calculated only for numeric data; Mode can be applied to both numeric and non-numeric data. A favourite NTA trap is to claim Mode is only for numbers.
- Structured vs. Unstructured misclassification: Students often label email body as structured data. The body is unstructured; the metadata (subject, recipient, attachment) gives it partial structure. Newspaper layout, tweets, audio files are all unstructured.
- No mode vs. multiple modes (NCERT §5.5.1(C), p. 89). A dataset where every value is unique has no mode. If two or more values share the highest frequency, the dataset has multiple modes. NTA may assert "a dataset always has exactly one mode" — this is false.
- Mean has no necessary connection to any actual data point (NCERT §5.5.1(A), p. 88). The mean might not appear anywhere in the dataset. NTA distractor: claims mean must be one of the observations.
- Range only on numerical data (NCERT §5.5.2(A), p. 90). Cannot compute range on categorical data. NTA tests this nuance.
- CSV is just a file format, not a data type (NCERT §5.2, p. 85). It is a structured representation but the values inside may be numbers or strings.
- Information ≠ Data (NCERT §5.4, p. 87). Information is the output of processing data — they are not interchangeable.
- DBMS overcomes file processing limitations (NCERT §5.3, p. 86). Flat files have redundancy/consistency issues; DBMS resolves them.
- Median splits the data 50/50 (NCERT §5.5.1(B), p. 89). Half of all values are ≤ median and half ≥ median.
🎯 Practice MCQs
First 3 questions free · create a free account to unlock the rest — answers & explanations included, no payment needed
Q1. Which of the following is the correct definition of "data" as given in the NCERT chapter?
▸ Show answer & explanation
Answer: B
Option B is the textbook definition of data. Option C describes a knowledge base, and option D defines median — two classic distractors drawn from the same chapter.
Q2. Consider the following statements about structured and unstructured data: **Statement I:** Structured data is stored in a tabular format where each column represents an attribute and each row represents an observation. **Statement II:** Unstructured data can sometimes be described with the help of metadata such as image size, image type, and image resolution. Which of the above statements is/are correct?
▸ Show answer & explanation
Answer: C
Both statements are directly stated in the NCERT text. Statement I captures the row-column structure definition; Statement II is explicitly given for image metadata. Both are correct, making C the answer.
Q3. A school teacher records the following marks obtained by 7 students in a unit test: **45, 60, 55, 60, 70, 45, 60** What is the Mode of this dataset?
▸ Show answer & explanation
Answer: C
60 appears 3 times, which is more than any other value (45 appears twice, all others once). By definition, mode is the value with the highest frequency of occurrence.
🔒 12 more practice MCQs
Create a free account to unlock every MCQ in this chapter — answers and explanations included. No payment needed.
Already registered? Just log in and they'll all appear here.
Q4. Match the following storage devices in Column A with their correct abbreviations in Column B, as listed in NCERT Chapter 5: | Column A | Column B | |---|---| | (i) Hard Disk Drive | (P) SSD | | (ii) Solid State Drive | (Q) HDD | | (iii) A removable flash storage device | (R) CD/DVD | | (iv) Optical disc storage | (S) Pen Drive |
▸ Show answer & explanation
Answer: A
The NCERT lists HDD for Hard Disk Drive, SSD for Solid State Drive, Pen Drive as a removable flash storage device, and CD/DVD as optical disc storage. Only option A maps all four correctly.
Q5. **Assertion (A):** Mean is not a suitable measure of central tendency when there are outliers in the data. **Reason (R):** An outlier is an exceptionally large or small value that can influence the average or other statistical calculations based on the data.
▸ Show answer & explanation
Answer: A
Both statements are explicitly present in the NCERT text, and the definition of outlier directly explains why mean is distorted — R correctly and causally explains A.
Q6. The heights (in cm) of 9 students are: 90, 102, 110, 115, 85, 90, 100, 110, 110. A student wants to find a measure that is NOT affected by the extreme value of 85 cm. Which measure should be chosen, and what is its value for this dataset?
▸ Show answer & explanation
Answer: B
Median is the appropriate choice when outliers are present because it is not distorted by extreme values. Sorting the data in ascending order gives [85, 90, 90, 100, 102, 110, 110, 110, 115]; the middle (5th) value is 102 cm. Mean (option A) is distorted by outliers; Range and Standard Deviation are measures of variability, not central tendency. ---
Q7. The mean of [10, 20, 30, 40, 50]:
▸ Show answer & explanation
Answer: B
Q8. Median of the sorted dataset [3, 5, 7, 9, 11, 13]:
▸ Show answer & explanation
Answer: B
Even count — average of middle two (7 and 9) = 8. ---
Q9. Which is a measure of variability?
▸ Show answer & explanation
Answer: D
Q10. Which file format is given as a typical ready-to-use digital file in §5.2?
▸ Show answer & explanation
Answer: B
Q11. The range of [12, 9, 20, 7, 18]:
▸ Show answer & explanation
Answer: B
20 − 7 = 13. ---
Q12. Which is NOT structured data?
▸ Show answer & explanation
Answer: C
Q13. The Python library mentioned for data analysis is:
▸ Show answer & explanation
Answer: B
Q14. Assertion (A): Standard deviation is preferred over range as a measure of dispersion. Reason (R): Standard deviation considers all data values, while range uses only the maximum and minimum.
▸ Show answer & explanation
Answer: A
Q15. Metadata for an image file includes:
▸ Show answer & explanation
Answer: B
📊 Previous-Year Questions
Practise with real CUET Computer Science previous-year papers — every question solved, with the correct answer and a step-by-step explanation.
View solved CUET PYQ papers →Ready to drill Computer Science?
Unlock all MCQs, chapter tests, mocks & PYQs for ₹199/year.
Get UniDrill Pro