Home / Computer Science / Class XII / Data Handling using Pandas - II
Data Handling using Pandas - II — CUET Computer Science hero
Class XII 💻 Computer Science ~8 MCQs/year Ch 16 of 20

Data Handling using Pandas - II

CUET unit: Data Handling using Pandas - II

📌 Snapshot

  • Pandas DataFrame operations extend to advanced statistical analysis, data aggregation, sorting, reshaping, and database connectivity — all high-frequency CUET topics.
  • Descriptive statistics (max, min, mean, median, mode, variance, std, quartile) are tested both as direct recall ("which function returns the median?") and as code-output tracing.
  • GROUP BY, pivot, and pivot_table are conceptually rich topics that NTA tests through scenario-based questions requiring students to predict output or identify the correct function.
  • Handling missing values (NaN) — detecting with isnull(), dropping with dropna(), and filling with fillna() — is a dedicated exam-ready subtopic.
  • The MySQL import/export functions (read_sql_query, read_sql_table, to_sql) test knowledge of the pymysql/sqlalchemy connection workflow.

📖 Detailed Notes

2.1 Core concepts

  • Introduction and Case Study: Pandas is a Python library for data manipulation and analysis. A marks DataFrame (marksUT) of 4 students (Raman, Zuhaire, Ashravy, Mishti) across 3 unit tests and 5 subjects (Maths, Science, S.St, Hindi, Eng) is the running example throughout all sections. (NCERT §3.1, p. 63–64)
  • DataFrame.max(): Returns the maximum value of each column. By default includes all data types (Name column returns alphabetically maximum value). Pass numeric_only=True to restrict to numeric columns. Pass axis=1 to get row-wise maximum. (NCERT §3.2.1, p. 65–66)
  • DataFrame.min(): Returns the minimum value of each column or row. Works identically to max() — supports numeric_only=True and axis=1 parameters. (NCERT §3.2.2, p. 67)
  • DataFrame.sum(): Returns the sum of all values per column, regardless of datatype. For text columns it concatenates strings. To get the sum of a specific column, specify the column name: df['Maths'].sum(). For row-wise sum use axis=1. (NCERT §3.2.3, p. 68–69)
  • DataFrame.count(): Returns the total number of non-null values in each column (default axis=0). Use axis=1 to count values per row. Unlike sum/mean, count works on all data types and does not skip NaN automatically — it counts only present values. (NCERT §3.2.4, p. 69–70)
  • DataFrame.mean(): Returns the mean (average) of numeric columns only. df.mean() gives column-wise averages; axis=1 gives row-wise averages. (NCERT §3.2.5, p. 70–71)
  • DataFrame.median(): Returns the middle value of numeric data. When the count of values is even, the median is the average of the two middle values. Only applicable to numeric values. (NCERT §3.2.6, p. 71–72)
  • DataFrame.mode(): Returns the value that appears the most number of times in the data. The mode is defined as the most frequent value. Applicable per column. (NCERT §3.2.7, p. 72–73)
  • DataFrame.quantile(): Divides data into four parts. By default outputs the second quartile (median, q=0.5). Use q=0.25 for Q1 (25th percentile) and q=0.75 for Q3 (75th percentile). Can accept a list: quantile([0.25, 0.75]). (NCERT §3.2.8, p. 73–74)
  • DataFrame.var(): Returns the variance — the average of squared differences from the mean. Applicable to numeric columns only. (NCERT §3.2.9, p. 74–75)
  • DataFrame.std(): Returns standard deviation, calculated as the square root of variance. (NCERT §3.2.10, p. 75)
  • DataFrame.describe(): Displays all descriptive statistics (count, mean, std, min, 25%, 50%, 75%, max) in a single command for all numeric columns. (NCERT §3.2.10, p. 75)
  • Note on axis parameter: In most Pandas statistical functions, axis=0 (default) gives column-wise output and axis=1 gives row-wise output. This is the reverse of most other Python operations where axis=0 is rows. (NCERT §3.2.1 Note, p. 67)
  • Data Aggregations (§3.3): Aggregation transforms a dataset into a single numeric value. The aggregate() or agg() function applies one or more aggregate functions (max, min, sum, count, std, var) to columns. Multiple functions can be applied at once: df.aggregate(['max','count']). The axis parameter works the same way. (NCERT §3.3, p. 75–76)
  • Sorting a DataFrame (§3.4): DataFrame.sort_values(by, axis=0, ascending=True) arranges data in ascending or descending order. The by parameter specifies the column(s) to sort on. Sorting on multiple columns: if two rows have the same value for the first sort column, the second column is used as a tiebreaker. Default is ascending order. (NCERT §3.4, p. 77–78)
  • GROUP BY Functions (§3.5): DataFrame.GROUP BY(column) splits data into groups based on a criterion. It follows a split-apply-combine strategy: (1) Split the DataFrame into groups, (2) Apply a function to each group, (3) Combine results into a new DataFrame. Key methods on a GroupBy object: first(), size(), groups, get_group(name). Grouping by multiple columns is also possible. Aggregation functions can be applied on GroupBy objects using agg(). (NCERT §3.5, p. 79–82)
  • Altering the Index (§3.6): By default, a numeric index starting from 0 is assigned. When a DataFrame is sliced, the index becomes non-continuous. reset_index(inplace=True) creates a new continuous index while preserving the original. The original index column can be dropped with drop(columns=['index'], inplace=True). A column can be set as the index using set_index('column_name', inplace=True) and reverted using reset_index('column_name', inplace=True). (NCERT §3.6, p. 82–84)
  • Reshaping Data — pivot() (§3.7.1A): df.pivot(index=, columns=, values=) reshapes a DataFrame by using one column as the new row index, another as new column headers, and a third as cell values. Cells without matching entries are filled with NaN. Pivot fails with a ValueError if the index column has duplicate values. (NCERT §3.7.1A, p. 84–86)
  • Pivoting by Multiple Columns (§3.7.1B): Multiple column names can be passed to the values parameter of pivot(). Omitting values will display all numeric columns in the pivoted form. (NCERT §3.7.1B, p. 86–87)
  • pivot_table() (§3.7.1C): Works like pivot() but handles duplicate index entries by applying an aggregate function. Syntax: pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean'). The default aggregate function is mean. The aggfunc parameter can accept a list of functions or a dictionary mapping columns to different functions. (NCERT §3.7.1C, p. 88–89)
  • Handling Missing Values (§3.8): A missing value is denoted by NaN. Two main strategies: (i) drop the row with missing value, (ii) fill or estimate the missing value. (NCERT §3.8, p. 89–90)
  • Checking Missing Values (§3.8.1): df.isnull() returns a boolean DataFrame — True where value is missing. df.isnull().any() returns True for each column that has at least one missing value. df.isnull().sum() gives the count of NaN per column. df.isnull().sum().sum() gives total NaN count in the entire DataFrame. (NCERT §3.8.1, p. 91–93)
  • Dropping Missing Values (§3.8.2): df.dropna() removes any row containing at least one NaN. Use inplace=True to modify the original DataFrame. dropna(how='any') is the default (drop row if any NaN present). Using dropna reduces dataset size, so it should be used sparingly. (NCERT §3.8.2, p. 94–96)
  • Estimating/Filling Missing Values (§3.8.3): df.fillna(num) replaces all NaN with the specified value. fillna(0) replaces with 0, fillna(1) with 1. fillna(method='pad') replaces a NaN with the value immediately before it (forward fill). fillna(method='bfill') replaces with the value immediately after it (backward fill). (NCERT §3.8.3, p. 96–97)
  • Import/Export between Pandas and MySQL (§3.9): Requires two libraries: pymysql (database driver, install with pip install pymysql) and sqlalchemy (connection interface, install with pip install sqlalchemy). The create_engine() function from sqlalchemy establishes a connection using the connection string: 'mysql+pymysql://username:password@host:port/database_name'. Default MySQL port is 3306. (NCERT §3.9, p. 98–99)
  • Importing from MySQL to Pandas (§3.9.1): Three functions: (1) pd.read_sql_query(query, sql_conn) — reads SQL query result into DataFrame; (2) pd.read_sql_table(table_name, sql_conn) — reads an entire table into DataFrame; (3) pd.read_sql(sql, sql_conn) — reads either a query or a table name. (NCERT §3.9.1, p. 99–100)
  • Exporting from Pandas to MySQL (§3.9.2): DataFrame.to_sql(table, sql_conn, if_exists='fail', index=False/True). The if_exists parameter accepts: 'fail' (default — raises ValueError if table exists), 'replace' (overwrites existing table), 'append' (adds rows to existing table). The index parameter controls whether the DataFrame index is written to MySQL. (NCERT §3.9.2, p. 100–101)

2.2 Definitions to memorise

Term Definition Page
Descriptive Statistics Methods used to summarise data and get a basic idea about the dataset 65
max() Returns the maximum value in each column (or row with axis=1) of a DataFrame 65
min() Returns the minimum value in each column (or row with axis=1) of a DataFrame 67
sum() Returns the sum of all values in each column of a DataFrame 68
count() Returns the total number of non-null values in each column or row 69
mean() Returns the arithmetic average of numeric values in each column 70
median() Returns the middle value of sorted data; average of two middle values when count is even 71
mode() Returns the value that appears the most number of times in the data 72
quantile() Divides data into four parts; Q1=25%, Q2=50% (median), Q3=75% 73
variance (var()) Average of squared differences from the mean 74
Standard Deviation (std()) Square root of variance 75
describe() Displays all descriptive statistics in a single command 75
Aggregation Transforming a dataset to produce a single numeric value; functions: max, min, sum, count, std, var 75
sort_values() Arranges DataFrame data in ascending or descending order by specified column(s) 77
GROUP BY Splits data into groups based on criteria; follows split-apply-combine strategy 79
Altering the Index Changing the row labels/indexes of a DataFrame using reset_index() or set_index() 82
Reshaping Changing the shape (structure) of a DataFrame; done using pivot() or pivot_table() 84
pivot() Reshapes DataFrame by using one column as index, another as column headers, and a third as values; fails with duplicate index entries 85
pivot_table() Like pivot() but handles duplicate entries using an aggregate function (default: mean) 88
NaN Not a Number — symbol used to denote a missing value in a DataFrame 89
isnull() Returns a boolean DataFrame with True where values are missing (NaN) 91
dropna() Removes entire rows containing any missing value from a DataFrame 94
fillna() Replaces missing values with a specified number or using a method (pad/bfill) 96
pymysql Python library (database driver) required to connect Python to MySQL 98
sqlalchemy Python library that provides create_engine() for establishing MySQL connections 98
create_engine() sqlalchemy function that establishes a connection to MySQL using a connection string 98
read_sql_query() Reads result of an SQL query into a pandas DataFrame 99
read_sql_table() Reads an entire MySQL table into a pandas DataFrame 99
to_sql() Writes a pandas DataFrame to a MySQL table 100
Q1 (25th percentile) First quartile — splits the lower 25% of data 73
Q3 (75th percentile) Third quartile — splits the upper 25% of data 73
numeric_only=True Argument restricting a statistical function to numeric columns 65
Forward fill (pad) Fill NaN with the previous non-NaN value 96
Backward fill (bfill) Fill NaN with the next non-NaN value 96
set_index() DataFrame method that turns a column into the index 83
reset_index() DataFrame method that creates a fresh integer index 82
split-apply-combine The conceptual strategy behind GROUP BY 79
agg() / aggregate() DataFrame method applying one or more aggregate functions 75-76
Default port (MySQL) 3306 98
Connection string URL of form mysql+pymysql://user:pwd@host:port/db 98

2.3 Diagrams / processes to remember

  • Split-Apply-Combine diagram (Figure 3.1, p. 79): Shows a two-column DataFrame (key, data) being split into groups A, B, C; sum applied to each group; results combined into a new DataFrame. This is the conceptual backbone of GROUP BY.
  • describe() output table (p. 75): Shows all statistics (count, mean, std, min, 25%, 50%, 75%, max) for every numeric column side by side — the only function that gives a full statistical summary in one command.
  • pivot() vs pivot_table() difference (p. 85–88): pivot() fails on duplicate index values (raises ValueError), while pivot_table() handles duplicates through aggregation (default: mean).
  • fillna() methods (p. 96–97): fillna(0) replaces with zero; fillna(method='pad') replaces with the previous value; fillna(method='bfill') replaces with the next value — all three scenarios are tested in CUET.

2.4 Common confusions / NTA trap points

  • axis=0 vs axis=1 in statistical functions: In Pandas statistical methods (max, min, mean, etc.), axis=0 (default) produces column-wise results and axis=1 produces row-wise results. NTA distractors often swap these — "axis=1 gives column-wise output" is a classic wrong option.
  • pivot() vs pivot_table(): Students confuse when to use each. Key rule: if index column has duplicate values, pivot() raises ValueError and pivot_table() must be used. The default aggfunc in pivot_table() is mean, not sum.
  • dropna() vs fillna(): dropna() removes the entire row (reduces dataset size), while fillna() preserves the row and substitutes a value. NTA sometimes presents scenarios where the wrong strategy is applied.
  • isnull() vs any(): df.isnull() returns a full boolean DataFrame; df.isnull().any() returns one True/False per column; df.isnull().sum() returns count of NaN per column. Confusing these in code-output questions is a frequent error.
  • to_sql() if_exists parameter (NCERT §3.9.2, p. 100-101). 'fail' (default), 'replace', 'append'. 'overwrite' does NOT exist.
  • pivot_table default aggfunc is mean (NCERT §3.7.1C, p. 88). Not sum. NTA distractor: claims sum is default.
  • mode() returns a DataFrame, not a single value (NCERT §3.2.7, p. 72-73). Because data may be multimodal.
  • std()var() (NCERT §3.2.9-10, p. 74-75). std is square root of variance.
  • quantile() defaults to 0.5 (median) (NCERT §3.2.8, p. 73-74).
  • describe() only summarises numeric columns (NCERT §3.2.10, p. 75).
  • read_sql_query vs read_sql_table (NCERT §3.9.1, p. 99). Former takes a query string; latter takes a table name.

🎯 Practice MCQs

First 3 questions free · create a free account to unlock the rest — answers & explanations included, no payment needed

Q1. Which of the following Pandas functions displays all descriptive statistical values (count, mean, std, min, quartiles, max) for all numeric columns in a single command?

▸ Show answer & explanation

Answer: C

`df.describe()` is the only built-in Pandas method that outputs all key descriptive statistics together. `df.aggregate()` applies specific functions you name; `df.summary()` and `df.stat()` do not exist in Pandas. ---

Q2. Consider a DataFrame `df`. What will `df.max(axis=1)` return?

▸ Show answer & explanation

Answer: B

In Pandas statistical functions, axis=1 means operate across columns to give a per-row result. axis=0 (default) gives per-column results. Option A describes the default axis=0 behaviour. ---

Q3. Which of the following statements about `DataFrame.mode()` is correct?

▸ Show answer & explanation

Answer: C

Mode is the most frequent value. Option A describes mean(), Option B describes median(). Option D is incorrect — mode() can operate on non-numeric columns as well. ---

🔒 12 more practice MCQs

Create a free account to unlock every MCQ in this chapter — answers and explanations included. No payment needed.

Already registered? Just log in and they'll all appear here.

📊 Previous-Year Questions

Practise with real CUET Computer Science previous-year papers — every question solved, with the correct answer and a step-by-step explanation.

View solved CUET PYQ papers →

Ready to drill Computer Science?

Unlock all MCQs, chapter tests, mocks & PYQs for ₹199/year.

Get UniDrill Pro