Home

Tutoring

Subjects

Live Classes

Study Coach

Essay Review

On-Demand Courses

Colleges

Games

Opening subject page...

Loading your content

Home

Tutoring

Subjects

Live Classes

Study Coach

Essay Review

On-Demand Courses

Colleges

Games

← Back to quizzes

AP Computer Science Principles

AP Computer Science Principles Quiz: Extracting Information From Data

Practice Extracting Information From Data in AP Computer Science Principles with focused quiz questions that help you check what you know, review explanations, and build confidence with test-style prompts.

What this quiz covers

This quiz focuses on Extracting Information From Data, giving you a quick way to practice the rules, question types, and explanations that matter most for AP Computer Science Principles.

How to use this quiz

Try each quiz question before looking at the correct answer. Use the explanations to review missed ideas, then come back to similar questions until the pattern feels familiar.

Question 1

A musician has a large digital library of several thousand audio files on their computer. The musician wants to quickly create a playlist of all songs released in a specific year.

How does metadata make this task easier?

Metadata allows the audio data to be compressed, reducing the file size and making the files load faster for playback.
Metadata, such as the release year stored with each file, can be used by software to filter and organize the entire collection.
Metadata changes the primary audio data to include a spoken announcement of the year at the beginning of each song file.
Metadata provides an encrypted key that allows the musician to prove they are the legal owner of the audio files.

Explanation: The correct answer is B. Metadata includes fields like artist, album, and release year. Music library software can read this structured information to allow users to efficiently search, sort, and filter their collection, making it easy to find all songs from a specific year without analyzing the audio itself. The other options describe different concepts like compression, data alteration, and digital rights management.

Question 2

A user changes the filename of a digital audio file from track01.mp3 to favorite_song.mp3 using their computer's file manager.

Which of the following best describes the effect of this action?

The primary audio data is altered, which will change the sound of the music when the file is played.
The file's metadata is changed, but the primary audio data that represents the music remains unchanged.
Both the metadata and the primary audio data are compressed to reduce the file's overall storage size.
A new piece of metadata is added to the file to track the history of all of its previous filenames for recovery.

Explanation: The correct answer is B. The filename is a piece of metadata used by the operating system to identify and organize the file. Changing it does not alter the underlying sequence of bits that represent the music itself (the primary data). Therefore, the music will sound identical after the file is renamed.

Question 3

A city government has a dataset of all public park locations and a separate dataset of all reported crime incidents with their specific locations. A data analyst combines these two datasets.

What new knowledge could be generated by combining these two data sources that is not available from either source alone?

The total number of parks in the city and the total number of crimes reported during the year.
The names of the parks that are largest in size and the most common type of reported crime.
An identification of which parks have the highest or lowest crime rates, by linking crimes to park locations.
The average response time for police to arrive at a crime scene located inside a public park.

Explanation: The correct answer is C. Combining datasets allows for the discovery of relationships between the data points. By linking crime locations to park locations, the analyst can create new information—the crime rate within parks—which was not present in either of the original datasets alone. The other options describe information that can be found in the individual datasets or would require a third, different dataset.

Question 4

A tech company develops a new application and wants to gather user feedback. They collect data by sending a survey link exclusively to a list of technology bloggers and journalists who have previously written about similar apps.

Which of the following is the most likely source of bias in the collected data?

The data collection method over-represents individuals who are technology experts, whose opinions may not reflect the general user population.
The survey questions might be excessively long, leading to incomplete data from some of the respondents who start but do not finish.
The company is collecting too much data from each respondent, which will make it difficult to process and analyze the results effectively.
The data are collected anonymously, which prevents the company from asking follow-up questions to the respondents to clarify their answers.

Explanation: The correct answer is A. The data is being collected from a specific, non-representative group (tech bloggers and journalists). This method introduces sampling bias because this group's experience, expectations, and technical skills may differ significantly from those of an average user, leading to skewed feedback.

Question 5

A web form asks users to enter their birth year. A data analyst later finds entries such as "1995", "95", and "Two Thousand". The analyst is trying to calculate the average age of users.

Which of the following data processing challenges is best illustrated by this scenario?

The need to handle invalid and non-uniform data through a data cleaning process before performing calculations.
The issue of scalability, as the dataset is too large for a single person to analyze manually without assistance.
The ethical concern of collecting personally identifiable information such as birth year without explicit user consent.
The problem of finding correlations in the data that do not indicate a real causal relationship between variables.

Explanation: The correct answer is A. The data contains entries that are not in a consistent, numerical format ('Two Thousand', '95'). These are invalid or non-uniform entries for the purpose of a numerical calculation. Before the average age can be calculated, the data must be cleaned by standardizing the valid entries into a single format (e.g., four-digit year) and handling or removing the invalid ones.

Question 6

A marketing analyst wants to combine a customer database from their company with a publicly available demographic dataset. In the company database, customer locations are stored by ZIP code. In the public dataset, locations are stored by city and state name.

Which of the following describes a significant challenge the analyst will face when trying to combine these data sources?

The public dataset is likely too large to be processed without using specialized parallel computing hardware.
The two datasets use different formats for location data, requiring a transformation or lookup table to link records.
The company's customer data contains personally identifiable information, which presents a significant security risk.
The data in both sources might contain a strong correlation that does not actually indicate a causal relationship.

Explanation: The correct answer is B. A major challenge in combining data from different sources is reconciling differences in how data is formatted or represented. To link a customer record by ZIP code to demographic data by city/state, the analyst would need a way to map ZIP codes to city/state names. This data transformation is a necessary step before the two datasets can be effectively combined.

Question 7

A city's transportation department collects data on the number of passengers using public transit each day. After analyzing several years of data, they notice that ridership consistently increases by 15% during the months of June, July, and August.

What does this analysis of the data primarily demonstrate?

A trend of higher ridership during summer months.
A causal relationship between summer weather and the decision to use public transit.
A data-cleaning process that removed outlier passenger counts for certain months.
An anomaly in the data collection method during one specific year of the study.

Explanation: The correct answer is A. The consistent, repeated increase in ridership during the same months each year is a pattern or trend. Option B is incorrect because while there is a correlation, the data provided does not prove that summer weather is the direct cause of the increase. Option C is incorrect because the description is about a finding from analysis, not the process of cleaning the data beforehand. Option D is incorrect because an anomaly would be a one-time or unusual event, whereas the scenario describes a consistent pattern over several years.

Question 8

A team of astronomers is processing image data from a new space telescope. The volume of data is so large that it cannot be stored or analyzed on a single, high-performance desktop computer.

Which of the following is a challenge specifically associated with processing such a large dataset?

The data must be cleaned to remove invalid entries or artifacts before it can be used for any scientific analysis.
The data may contain biases introduced by the telescope's specific location and limited viewing angle.
The computational capacity of a single machine is insufficient, likely requiring parallel or distributed systems for processing.
The data must be converted from an analog signal to a digital format before it can be analyzed by computer programs.

Explanation: The correct answer is C. Datasets can become so large that they exceed the memory, storage, or processing power of a single computer. This is a scalability challenge. The solution often involves using parallel systems (multiple processors on one machine) or distributed systems (multiple machines working together) to handle the storage and computation. While other options can be challenges, this one is specifically related to the large size of the dataset.

Question 9

A user takes a digital photograph with a modern smartphone. The photograph file contains both the image data itself and additional information about the image.

Which of the following is an example of metadata that would be associated with the digital photograph file?

The total number of pixels that are predominantly blue in the photograph's image.
The specific time and GPS coordinates where the photograph was taken.
A compressed version of the image data used for creating a thumbnail preview.
The names of any individuals identified in the photograph through facial recognition.

Explanation: The correct answer is B. Metadata is data about data. The image itself is the primary data. The time and location where the photo was taken are descriptive pieces of information about the photo file, not part of the visual content itself, making them metadata. The other options describe information derived or extracted from the primary image data, not metadata about the file's creation or properties.

Question 10

In the scenario described, a retail chain stores sales transactions to track monthly performance across product categories. Each receipt becomes one record with fields Month (Jan–Apr), Category (Electronics), and Amount (USD), and the totals are summed by month for reports. Data is collected automatically from the point-of-sale system and used to plan inventory. Based on the dataset provided, what trend can be observed in monthly Electronics sales totals? Totals: Jan $42,000; Feb$ 39,000; Mar $45,000; Apr$ 51,000.

Sales peak in February and then steadily decline.
Sales dip in February, then rise through April.
Sales remain the same across all four months.
Sales are highest in January and lowest in April.

Explanation: This question tests AP Computer Science Principles skills, specifically extracting and interpreting information from data. Data extraction involves identifying relevant patterns and trends from structured datasets, essential for making informed decisions. In the dataset provided, Electronics sales show a dip from

42,000 in January to

39,000 in February, followed by consistent growth through March (

45,000) and April (

51,000), highlighting a V-shaped recovery pattern. Choice B is correct because the data clearly indicates sales decrease in February then rise steadily through April, demonstrating understanding of non-linear trends in time series data. Choice A is incorrect because it claims sales peak in February when February actually shows the lowest sales figure at $39,000, misinterpreting the data completely. This error often occurs when students read data too quickly or confuse months. To help students: Teach them to create simple line graphs to visualize trends over time. Practice identifying turning points and recovery patterns in business data. Watch for: students who only compare adjacent months rather than viewing the complete trend, or those who assume all trends must be linear.

Question 11

A market research firm analyzes data from a large number of cities and finds a strong correlation between the number of ice cream shops in a city and the number of public swimming pools in that city.

Which of the following is the most reasonable conclusion that can be drawn from this correlation?

Opening more ice cream shops in a city will cause the city to build more swimming pools.
Building more swimming pools in a city will cause more ice cream shops to open.
Cities with more ice cream shops are also likely to have more swimming pools, possibly due to a third factor like a warmer climate.
The data collected by the firm must be biased, as there is no logical connection between ice cream shops and swimming pools.

Explanation: The correct answer is C. The data shows a correlation, meaning the two variables tend to increase together. However, correlation does not imply causation. A third, unmeasured variable (a 'lurking variable') such as a city's warmer climate or larger population is a likely reason for both more pools and more ice cream shops. Options A and B incorrectly assume a causal relationship. Option D is incorrect because a correlation can exist even if the causal link isn't direct; dismissing the data as biased is not the correct interpretation.

Question 12

A health researcher wants to investigate the relationship between air quality and the incidence of respiratory illness in a specific region. The researcher has access to a dataset from local hospitals containing patient diagnoses and admission dates.

To achieve their goal, which of the following additional datasets would be most useful to combine with the hospital data?

A dataset of local weather patterns, including daily temperature and precipitation for the region.
A dataset from environmental agencies that contains daily air pollutant levels for the same region.
A dataset of census information showing the population density and demographics of the region.
A dataset of sales records from local pharmacies for over-the-counter cold remedies in the region.

Explanation: The correct answer is B. To study the relationship between air quality and respiratory illness, the researcher needs data on both variables. The hospital data provides information on illness, so combining it with a dataset on air pollutant levels directly provides the other necessary variable for the analysis. The other datasets might be useful for a broader study but are not as central to the stated goal.

Question 13

A company collects customer addresses through an online form. When analyzing the data, they find entries for the state of California recorded as "CA", "Calif.", and "California". To effectively analyze the data by state, this issue must be resolved.

Which of the following data processing tasks is required to solve this problem?

Data filtering, which involves removing all records that do not have a standard state abbreviation.
Data compression, which involves reducing the storage space required for the state information.
Data cleaning, which involves making the data uniform by converting all variations to a single, consistent format.
Data scaling, which involves adjusting the range of numerical data to fit within a common scale for analysis.

Explanation: The correct answer is C. Data cleaning is the process that makes data uniform without changing its meaning. In this scenario, converting the different representations of California to a single, standard format (like 'CA') is a classic data cleaning task to ensure consistency for analysis. Filtering would result in data loss, compression deals with size, and scaling applies to numerical data.

Question 14

A social media platform analyzes user activity data, including posts, likes, and connections between users. By processing this data, the platform can identify groups of users who frequently interact with each other and share common interests.

What kind of information is being extracted from the data in this scenario?

Metadata describing the file types and creation dates of user-uploaded images and videos.
Patterns of community structure and user relationships within the platform's network.
A causal link proving that using the platform leads to forming new real-world friendships.
The raw text and image content from every individual user post made on the platform.

Explanation: The correct answer is B. The process described involves analyzing raw user activity data to discover higher-level patterns—specifically, how users are connected and form communities. This is an example of extracting new information and insight (patterns) from existing data. Option D is the raw data itself, not the extracted information. Option A is metadata, not interaction patterns. Option C assumes causation, which the data does not prove.

Question 15

An e-book file contains the text of a novel as its primary data.

For which of the following tasks would analyzing the e-book's metadata be more useful than analyzing its primary data?

Determining the frequency of a specific character's name appearing throughout the novel.
Finding the total number of words contained in the entire novel.
Identifying the publisher and the original publication date of the e-book.
Creating an automated summary of the novel's main plot points.

Explanation: The correct answer is C. The publisher and publication date are facts about the novel file, not part of the story's content. This information is typically stored as metadata, making it easy to access without having to process the entire text of the book. The other options all require analysis of the primary data (the text of the novel).

Question 16

A scientist is analyzing a dataset of weather observations where the temperature sensor occasionally failed to record a value, leaving some entries blank.

What is the most significant challenge this presents for the analysis?

The blank entries introduce a systematic bias, making the calculated average temperature appear higher than it actually is.
The incomplete data may prevent certain calculations, like a daily average, or require estimation methods to fill in missing values.
The blank entries are likely part of a lossless compression scheme used to store the data more efficiently.
The dataset is considered too large to be processed on a single computer due to the presence of the blank entries.

Explanation: The correct answer is B. Incomplete data is a common challenge in data processing. Missing values, like the blank temperature readings, can prevent or skew calculations such as an average. Analysts must decide how to handle them, for example, by removing the incomplete records or by using statistical methods to estimate the missing values, which adds complexity to the analysis.

Question 17

An online retail website collects data on customers' purchase histories and browsing activities. The website's software analyzes this data to suggest other products the customer might like.

This functionality is primarily an example of using data to do which of the following?

Identify trends in overall sales from one month to the next.
Make connections between different products based on user behavior.
Clean the customer data to ensure all shipping addresses are valid.
Compress the purchase history data to save storage space on servers.

Explanation: The correct answer is B. Recommendation engines work by finding connections and patterns in data. In this case, the system identifies that customers who buy or view product X also tend to buy or view product Y, and it uses this discovered connection to make a recommendation to a new customer looking at product X.

Question 18

To study the average sleep habits of teenagers, a researcher surveys students who are arriving at a high school at 7:00 AM on a school day.

How might this data collection method introduce bias into the study's findings?

The method is likely to under-represent teenagers who get more sleep, as they might arrive at school at a later time.
The data is invalid because the researcher did not survey every teenager in the entire world for the study.
The data is incomplete because it does not include information about what the teenagers ate for breakfast that morning.
The method relies on self-reported data from students, which is a common form of professional data cleaning.

Explanation: The correct answer is A. The sample is limited to students who arrive at school very early. This group may have different sleep habits (e.g., they may wake up earlier and get less sleep) than the general teenage population. This sampling bias would likely skew the results to show a lower average amount of sleep than is actually the case for all teenagers.

Question 19

A research institute maintains a large database of scientific articles. Each article is stored as a PDF file, and the database also stores the title, authors, publication year, and keywords for each article in separate fields.

Which of the following is the most likely reason the database stores this additional information separately from the content of the PDF files?

This information is metadata that allows for efficient searching and filtering without needing to open and read every PDF file.
This information is considered the primary data by the institute, while the PDF file is just a backup copy for reference.
Storing this information separately provides a form of lossless data compression for the large PDF files.
It is a security measure to prevent the PDF files from being altered, as changing the metadata would invalidate the file.

Explanation: The correct answer is A. Storing structured information like title, author, and keywords as metadata allows the database system to perform very fast searches and queries. If the system had to open and scan the full text of every PDF for each search, it would be extremely inefficient. The metadata provides a structured, easily searchable index to the collection of primary data (the articles themselves).

Question 20

A company claims their new productivity app helps students improve their grades. They support this claim with data showing that 80% of students who use their app for at least five hours a week also report getting A's and B's in their classes.

Why might this data be insufficient to conclude that the app causes grades to improve?

The data only shows a correlation; it is possible that students who are already high-achieving are more likely to use a productivity app.
The data is inherently biased because it was collected by the same company that makes and sells the app.
The data is incomplete because it does not include students who use the app for less than five hours a week.
The data sample is too small to be meaningful, as it only includes students who report getting A's and B's.

Explanation: The correct answer is A. This is a classic example of correlation not implying causation. While there is a connection (correlation) between using the app and getting good grades, it's possible that a third factor (like being a motivated or organized student) causes both outcomes. The data provided does not rule out this alternative explanation, so a causal claim cannot be made.