📌 Snapshot
- Pandas DataFrame operations extend to advanced statistical analysis, data aggregation, sorting, reshaping, and database connectivity — all high-frequency CUET topics.
- Descriptive statistics (max, min, mean, median, mode, variance, std, quartile) are tested both as direct recall ("which function returns the median?") and as code-output tracing.
- GROUP BY, pivot, and pivot_table are conceptually rich topics that NTA tests through scenario-based questions requiring students to predict output or identify the correct function.
- Handling missing values (NaN) — detecting with isnull(), dropping with dropna(), and filling with fillna() — is a dedicated exam-ready subtopic.
- The MySQL import/export functions (read_sql_query, read_sql_table, to_sql) test knowledge of the pymysql/sqlalchemy connection workflow.
📖 Detailed Notes
2.1 Core concepts
- Introduction and Case Study: Pandas is a Python library for data manipulation and analysis. A marks DataFrame (
marksUT) of 4 students (Raman, Zuhaire, Ashravy, Mishti) across 3 unit tests and 5 subjects (Maths, Science, S.St, Hindi, Eng) is the running example throughout all sections. (NCERT §3.1, p. 63–64) - DataFrame.max(): Returns the maximum value of each column. By default includes all data types (Name column returns alphabetically maximum value). Pass
numeric_only=Trueto restrict to numeric columns. Passaxis=1to get row-wise maximum. (NCERT §3.2.1, p. 65–66) - DataFrame.min(): Returns the minimum value of each column or row. Works identically to max() — supports
numeric_only=Trueandaxis=1parameters. (NCERT §3.2.2, p. 67) - DataFrame.sum(): Returns the sum of all values per column, regardless of datatype. For text columns it concatenates strings. To get the sum of a specific column, specify the column name:
df['Maths'].sum(). For row-wise sum useaxis=1. (NCERT §3.2.3, p. 68–69) - DataFrame.count(): Returns the total number of non-null values in each column (default axis=0). Use
axis=1to count values per row. Unlike sum/mean, count works on all data types and does not skip NaN automatically — it counts only present values. (NCERT §3.2.4, p. 69–70) - DataFrame.mean(): Returns the mean (average) of numeric columns only.
df.mean()gives column-wise averages;axis=1gives row-wise averages. (NCERT §3.2.5, p. 70–71) - DataFrame.median(): Returns the middle value of numeric data. When the count of values is even, the median is the average of the two middle values. Only applicable to numeric values. (NCERT §3.2.6, p. 71–72)
- DataFrame.mode(): Returns the value that appears the most number of times in the data. The mode is defined as the most frequent value. Applicable per column. (NCERT §3.2.7, p. 72–73)
- DataFrame.quantile(): Divides data into four parts. By default outputs the second quartile (median, q=0.5). Use
q=0.25for Q1 (25th percentile) andq=0.75for Q3 (75th percentile). Can accept a list:quantile([0.25, 0.75]). (NCERT §3.2.8, p. 73–74) - DataFrame.var(): Returns the variance — the average of squared differences from the mean. Applicable to numeric columns only. (NCERT §3.2.9, p. 74–75)
- DataFrame.std(): Returns standard deviation, calculated as the square root of variance. (NCERT §3.2.10, p. 75)
- DataFrame.describe(): Displays all descriptive statistics (count, mean, std, min, 25%, 50%, 75%, max) in a single command for all numeric columns. (NCERT §3.2.10, p. 75)
- Note on axis parameter: In most Pandas statistical functions,
axis=0(default) gives column-wise output andaxis=1gives row-wise output. This is the reverse of most other Python operations where axis=0 is rows. (NCERT §3.2.1 Note, p. 67) - Data Aggregations (§3.3): Aggregation transforms a dataset into a single numeric value. The
aggregate()oragg()function applies one or more aggregate functions (max, min, sum, count, std, var) to columns. Multiple functions can be applied at once:df.aggregate(['max','count']). Theaxisparameter works the same way. (NCERT §3.3, p. 75–76) - Sorting a DataFrame (§3.4):
DataFrame.sort_values(by, axis=0, ascending=True)arranges data in ascending or descending order. Thebyparameter specifies the column(s) to sort on. Sorting on multiple columns: if two rows have the same value for the first sort column, the second column is used as a tiebreaker. Default is ascending order. (NCERT §3.4, p. 77–78) - GROUP BY Functions (§3.5):
DataFrame.GROUP BY(column)splits data into groups based on a criterion. It follows a split-apply-combine strategy: (1) Split the DataFrame into groups, (2) Apply a function to each group, (3) Combine results into a new DataFrame. Key methods on a GroupBy object:first(),size(),groups,get_group(name). Grouping by multiple columns is also possible. Aggregation functions can be applied on GroupBy objects usingagg(). (NCERT §3.5, p. 79–82) - Altering the Index (§3.6): By default, a numeric index starting from 0 is assigned. When a DataFrame is sliced, the index becomes non-continuous.
reset_index(inplace=True)creates a new continuous index while preserving the original. The original index column can be dropped withdrop(columns=['index'], inplace=True). A column can be set as the index usingset_index('column_name', inplace=True)and reverted usingreset_index('column_name', inplace=True). (NCERT §3.6, p. 82–84) - Reshaping Data — pivot() (§3.7.1A):
df.pivot(index=, columns=, values=)reshapes a DataFrame by using one column as the new row index, another as new column headers, and a third as cell values. Cells without matching entries are filled with NaN. Pivot fails with a ValueError if the index column has duplicate values. (NCERT §3.7.1A, p. 84–86) - Pivoting by Multiple Columns (§3.7.1B): Multiple column names can be passed to the
valuesparameter ofpivot(). Omittingvalueswill display all numeric columns in the pivoted form. (NCERT §3.7.1B, p. 86–87) - pivot_table() (§3.7.1C): Works like
pivot()but handles duplicate index entries by applying an aggregate function. Syntax:pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean'). The default aggregate function is mean. Theaggfuncparameter can accept a list of functions or a dictionary mapping columns to different functions. (NCERT §3.7.1C, p. 88–89) - Handling Missing Values (§3.8): A missing value is denoted by NaN. Two main strategies: (i) drop the row with missing value, (ii) fill or estimate the missing value. (NCERT §3.8, p. 89–90)
- Checking Missing Values (§3.8.1):
df.isnull()returns a boolean DataFrame — True where value is missing.df.isnull().any()returns True for each column that has at least one missing value.df.isnull().sum()gives the count of NaN per column.df.isnull().sum().sum()gives total NaN count in the entire DataFrame. (NCERT §3.8.1, p. 91–93) - Dropping Missing Values (§3.8.2):
df.dropna()removes any row containing at least one NaN. Useinplace=Trueto modify the original DataFrame.dropna(how='any')is the default (drop row if any NaN present). Using dropna reduces dataset size, so it should be used sparingly. (NCERT §3.8.2, p. 94–96) - Estimating/Filling Missing Values (§3.8.3):
df.fillna(num)replaces all NaN with the specified value.fillna(0)replaces with 0,fillna(1)with 1.fillna(method='pad')replaces a NaN with the value immediately before it (forward fill).fillna(method='bfill')replaces with the value immediately after it (backward fill). (NCERT §3.8.3, p. 96–97) - Import/Export between Pandas and MySQL (§3.9): Requires two libraries:
pymysql(database driver, install withpip install pymysql) andsqlalchemy(connection interface, install withpip install sqlalchemy). Thecreate_engine()function from sqlalchemy establishes a connection using the connection string:'mysql+pymysql://username:password@host:port/database_name'. Default MySQL port is 3306. (NCERT §3.9, p. 98–99) - Importing from MySQL to Pandas (§3.9.1): Three functions: (1)
pd.read_sql_query(query, sql_conn)— reads SQL query result into DataFrame; (2)pd.read_sql_table(table_name, sql_conn)— reads an entire table into DataFrame; (3)pd.read_sql(sql, sql_conn)— reads either a query or a table name. (NCERT §3.9.1, p. 99–100) - Exporting from Pandas to MySQL (§3.9.2):
DataFrame.to_sql(table, sql_conn, if_exists='fail', index=False/True). Theif_existsparameter accepts:'fail'(default — raises ValueError if table exists),'replace'(overwrites existing table),'append'(adds rows to existing table). Theindexparameter controls whether the DataFrame index is written to MySQL. (NCERT §3.9.2, p. 100–101)
2.2 Definitions to memorise
| Term | Definition | Page |
|---|---|---|
| Descriptive Statistics | Methods used to summarise data and get a basic idea about the dataset | 65 |
| max() | Returns the maximum value in each column (or row with axis=1) of a DataFrame | 65 |
| min() | Returns the minimum value in each column (or row with axis=1) of a DataFrame | 67 |
| sum() | Returns the sum of all values in each column of a DataFrame | 68 |
| count() | Returns the total number of non-null values in each column or row | 69 |
| mean() | Returns the arithmetic average of numeric values in each column | 70 |
| median() | Returns the middle value of sorted data; average of two middle values when count is even | 71 |
| mode() | Returns the value that appears the most number of times in the data | 72 |
| quantile() | Divides data into four parts; Q1=25%, Q2=50% (median), Q3=75% | 73 |
| variance (var()) | Average of squared differences from the mean | 74 |
| Standard Deviation (std()) | Square root of variance | 75 |
| describe() | Displays all descriptive statistics in a single command | 75 |
| Aggregation | Transforming a dataset to produce a single numeric value; functions: max, min, sum, count, std, var | 75 |
| sort_values() | Arranges DataFrame data in ascending or descending order by specified column(s) | 77 |
| GROUP BY | Splits data into groups based on criteria; follows split-apply-combine strategy | 79 |
| Altering the Index | Changing the row labels/indexes of a DataFrame using reset_index() or set_index() | 82 |
| Reshaping | Changing the shape (structure) of a DataFrame; done using pivot() or pivot_table() | 84 |
| pivot() | Reshapes DataFrame by using one column as index, another as column headers, and a third as values; fails with duplicate index entries | 85 |
| pivot_table() | Like pivot() but handles duplicate entries using an aggregate function (default: mean) | 88 |
| NaN | Not a Number — symbol used to denote a missing value in a DataFrame | 89 |
| isnull() | Returns a boolean DataFrame with True where values are missing (NaN) | 91 |
| dropna() | Removes entire rows containing any missing value from a DataFrame | 94 |
| fillna() | Replaces missing values with a specified number or using a method (pad/bfill) | 96 |
| pymysql | Python library (database driver) required to connect Python to MySQL | 98 |
| sqlalchemy | Python library that provides create_engine() for establishing MySQL connections | 98 |
| create_engine() | sqlalchemy function that establishes a connection to MySQL using a connection string | 98 |
| read_sql_query() | Reads result of an SQL query into a pandas DataFrame | 99 |
| read_sql_table() | Reads an entire MySQL table into a pandas DataFrame | 99 |
| to_sql() | Writes a pandas DataFrame to a MySQL table | 100 |
| Q1 (25th percentile) | First quartile — splits the lower 25% of data | 73 |
| Q3 (75th percentile) | Third quartile — splits the upper 25% of data | 73 |
numeric_only=True |
Argument restricting a statistical function to numeric columns | 65 |
Forward fill (pad) |
Fill NaN with the previous non-NaN value | 96 |
Backward fill (bfill) |
Fill NaN with the next non-NaN value | 96 |
set_index() |
DataFrame method that turns a column into the index | 83 |
reset_index() |
DataFrame method that creates a fresh integer index | 82 |
| split-apply-combine | The conceptual strategy behind GROUP BY | 79 |
agg() / aggregate() |
DataFrame method applying one or more aggregate functions | 75-76 |
| Default port (MySQL) | 3306 | 98 |
| Connection string | URL of form mysql+pymysql://user:pwd@host:port/db |
98 |
2.3 Diagrams / processes to remember
- Split-Apply-Combine diagram (Figure 3.1, p. 79): Shows a two-column DataFrame (key, data) being split into groups A, B, C; sum applied to each group; results combined into a new DataFrame. This is the conceptual backbone of GROUP BY.
- describe() output table (p. 75): Shows all statistics (count, mean, std, min, 25%, 50%, 75%, max) for every numeric column side by side — the only function that gives a full statistical summary in one command.
- pivot() vs pivot_table() difference (p. 85–88): pivot() fails on duplicate index values (raises ValueError), while pivot_table() handles duplicates through aggregation (default: mean).
- fillna() methods (p. 96–97):
fillna(0)replaces with zero;fillna(method='pad')replaces with the previous value;fillna(method='bfill')replaces with the next value — all three scenarios are tested in CUET.
2.4 Common confusions / NTA trap points
- axis=0 vs axis=1 in statistical functions: In Pandas statistical methods (max, min, mean, etc.), axis=0 (default) produces column-wise results and axis=1 produces row-wise results. NTA distractors often swap these — "axis=1 gives column-wise output" is a classic wrong option.
- pivot() vs pivot_table(): Students confuse when to use each. Key rule: if index column has duplicate values, pivot() raises ValueError and pivot_table() must be used. The default aggfunc in pivot_table() is mean, not sum.
- dropna() vs fillna(): dropna() removes the entire row (reduces dataset size), while fillna() preserves the row and substitutes a value. NTA sometimes presents scenarios where the wrong strategy is applied.
- isnull() vs any():
df.isnull()returns a full boolean DataFrame;df.isnull().any()returns one True/False per column;df.isnull().sum()returns count of NaN per column. Confusing these in code-output questions is a frequent error. - to_sql() if_exists parameter (NCERT §3.9.2, p. 100-101). 'fail' (default), 'replace', 'append'. 'overwrite' does NOT exist.
pivot_tabledefault aggfunc ismean(NCERT §3.7.1C, p. 88). Not sum. NTA distractor: claims sum is default.mode()returns a DataFrame, not a single value (NCERT §3.2.7, p. 72-73). Because data may be multimodal.std()≠var()(NCERT §3.2.9-10, p. 74-75). std is square root of variance.quantile()defaults to 0.5 (median) (NCERT §3.2.8, p. 73-74).describe()only summarises numeric columns (NCERT §3.2.10, p. 75).read_sql_queryvsread_sql_table(NCERT §3.9.1, p. 99). Former takes a query string; latter takes a table name.
🎯 Practice MCQs
First 3 questions free · create a free account to unlock the rest — answers & explanations included, no payment needed
Q1. Which of the following Pandas functions displays all descriptive statistical values (count, mean, std, min, quartiles, max) for all numeric columns in a single command?
▸ Show answer & explanation
Answer: C
`df.describe()` is the only built-in Pandas method that outputs all key descriptive statistics together. `df.aggregate()` applies specific functions you name; `df.summary()` and `df.stat()` do not exist in Pandas. ---
Q2. Consider a DataFrame `df`. What will `df.max(axis=1)` return?
▸ Show answer & explanation
Answer: B
In Pandas statistical functions, axis=1 means operate across columns to give a per-row result. axis=0 (default) gives per-column results. Option A describes the default axis=0 behaviour. ---
Q3. Which of the following statements about `DataFrame.mode()` is correct?
▸ Show answer & explanation
Answer: C
Mode is the most frequent value. Option A describes mean(), Option B describes median(). Option D is incorrect — mode() can operate on non-numeric columns as well. ---
🔒 12 more practice MCQs
Create a free account to unlock every MCQ in this chapter — answers and explanations included. No payment needed.
Already registered? Just log in and they'll all appear here.
Q4. A teacher wants to sort a DataFrame `dfUT3` (containing Unit Test 3 data) in ascending order of Science marks, but if two students have equal Science marks, sort them by Hindi marks in ascending order. Which statement achieves this?
▸ Show answer & explanation
Answer: B
Passing a list to the `by` parameter enables multi-column sorting. Default is ascending=True, so option B is correct. Option D would override the Science sort entirely with a fresh Hindi sort. ---
Q5. Read the following statements about GROUP BY in Pandas: **Statement I:** GROUP BY follows a split-apply-combine strategy. **Statement II:** The `get_group()` method is used to display the size of each group.
▸ Show answer & explanation
Answer: B
The split-apply-combine strategy is correctly described in Statement I. Statement II is wrong: `size()` displays the size (count of rows) of each group, while `get_group()` retrieves all rows belonging to a named group. ---
Q6. A DataFrame `df` has some missing values denoted by NaN. A student wants to find the total number of NaN values in the entire DataFrame. Which statement correctly gives this total count?
▸ Show answer & explanation
Answer: C
`df.isnull()` creates a boolean DataFrame; `.sum()` gives a Series with NaN count per column; a second `.sum()` adds all column counts into a single total. Option B gives per-column counts, not a single total. Option A gives True/False per column. ---
Q7. Match the following Pandas functions with their correct descriptions: | Function | Description | |---|---| | P. fillna(method='pad') | 1. Replaces missing value with the value that comes after it | | Q. fillna(0) | 2. Removes the entire row containing a missing value | | R. dropna() | 3. Replaces missing value with the value that comes before it | | S. fillna(method='bfill') | 4. Replaces missing value with zero |
▸ Show answer & explanation
Answer: A
P (pad) = forward fill = value before = 3; Q (0) = replace with zero = 4; R (dropna) = remove row = 2; S (bfill) = backward fill = value after = 1. ---
Q8. To export a pandas DataFrame `df` to a MySQL table named `'student_data'` such that if the table already exists its contents are completely replaced by the new DataFrame data, which statement is correct? (Assume `engine` is already created.)
▸ Show answer & explanation
Answer: C
`if_exists='replace'` overwrites the existing table. 'fail' raises ValueError if the table exists. 'append' adds rows without deleting existing ones. 'overwrite' is not a valid value for this parameter and does not exist in Pandas. ---
Q9. Default axis for `df.sum()` is:
▸ Show answer & explanation
Answer: A
Q10. Output: ```python import pandas as pd df = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6]}) print(df.sum(axis=1).tolist()) ```
▸ Show answer & explanation
Answer: B
Q11. Which function is required to reshape a DataFrame when duplicate index entries exist?
▸ Show answer & explanation
Answer: B
Q12. Which method REMOVES rows containing any NaN?
▸ Show answer & explanation
Answer: B
Q13. To connect Python to MySQL using Pandas, which two libraries are required?
▸ Show answer & explanation
Answer: A
Q14. Assertion (A): `df.fillna(method='pad')` propagates the last valid value forward. Reason (R): The pad method (also known as forward fill) replaces NaN with the value immediately preceding it.
▸ Show answer & explanation
Answer: A
Q15. Q1 of `[2, 4, 6, 8, 10, 12, 14, 16]` is:
▸ Show answer & explanation
Answer: B
Q1 is the 25th percentile — for 8 sorted values it falls between 4 and 6 (linear interp = 5).
📊 Previous-Year Questions
Practise with real CUET Computer Science previous-year papers — every question solved, with the correct answer and a step-by-step explanation.
View solved CUET PYQ papers →Ready to drill Computer Science?
Unlock all MCQs, chapter tests, mocks & PYQs for ₹199/year.
Get UniDrill Pro