Data Organization to Drive Decisions
August 01, 2024
mtcars
SalesData.xlsx
No class on September 2nd, Monday.
Grading Schedule: I plan to grade your submissions every Monday after our class.
Attendance and Participation: All Attendance and Participation assignments will be equally weighted at the end of the semester.
Business Analytics and Information Management (BAIM) Major: Plan of study.
BAIM Association: Learn how you can get involved.
Instagram: @baim.association
Structured Data: Data that is organized in a defined format, such as rows and columns in a database (e.g., an Excel spreadsheet).
Unstructured Data: Data that does not have a predefined structure, such as text, emails, social media posts, videos, and images.
Semi-Structured Data: Data that does not conform to a strict structure but contains tags or markers to separate elements (e.g., XML or JSON files).
Business data refers to the information gathered by an organization, such as customer data, financial data, sales data, employee data, and more. Business data can come from a wide variety of sources - from customers’ purchase transactions and social media activities to market research and financial reports.
Edgar F. Codd: Proposed the relational model in 1970, which became the foundation for modern databases.
RDBMS: A Relational Database Management System (RDBMS) is used to maintain relational databases.
SQL: Structured Query Language (SQL) was developed to interact (query and update) with relational databases.
Adoption: The relational model became dominant in the 1980s, with systems like Oracle, IBM DB2, and Microsoft SQL Server emerging. Nowadays, open-source systems like MySQL are used by big companies to handle their relational data.
Limitations of RDBMS: Traditional relational databases struggled with the scale and complexity of modern web applications.
NoSQL Databases: Emerged to address these challenges. They offer flexibility, scalability, and performance improvements. They are designed to scale horizontally (adding more servers), making them better suited for handling such large-scale data across distributed systems.
Types: Document (e.g., MongoDB), Key-Value (e.g., Redis), Column-Family (e.g., Cassandra), and Graph (e.g., Neo4j).
Use Cases: Ideal for big data, handling unstructured data, real-time web applications, and distributed systems.
NewSQL: Combines the scalability of NoSQL with the Atomicity, Consistency, Isolation, and Durability (ACID) guarantees of traditional relational databases (e.g., Google Spanner).
Cloud Databases: The adoption of cloud computing has led to the rise of managed database services (e.g., Amazon RDS, Google Cloud SQL).
Data Lakes: A storage repository that holds vast amounts of raw data in its native format (e.g., AWS S3, Azure Data Lake).
A relational database links data tables through pre-defined and shared fields in various data tables, establishing relationships.
This permits more efficient organization and utilization of data across multiple tables.
Moreover, a relational database serves as a potent tool for handling extensive data volumes and managing complex data structures.
Query 1: Flights from a Specific Carrier
Query 2: Weather Conditions at the Time of a Specific Flight
SELECT flights.flight, flights.origin,
flights.dest, weather.*
FROM flights
JOIN weather
ON flights.year = weather.year AND
flights.month = weather.month AND
flights.day = weather.day AND
flights.hour = weather.hour AND
flights.origin = weather.origin
WHERE flights.flight = 'AA123';
-- Replace 'AA123' with the desired flight number
A Data Warehouse is a large and comprehensive storage system that consolidates data from various sources, including relational databases, into a centralized repository, much like a university campus that encompasses buildings of various functions.
The primary goal of an Information Management System (IMS) is to ensure that accurate, timely, and relevant information is generated and available to the right people at the right time, enabling efficient and informed decision-making processes.
Finance Data: Information related to financial transactions, budgeting, and accounting.
CRM Data: Customer Relationship Management data, including customer interactions, sales, and service records.
Operations Data: Data concerning the day-to-day operations of a business, such as supply chain, inventory, and production.
More Data: Any additional data sources that contribute to the organization’s information ecosystem.
Extract: Data is collected from various sources, such as finance systems, CRM systems, and operations databases.
Transform: The extracted data is cleaned, aggregated, and formatted to fit the data warehouse schema.
Load: The transformed data is loaded into the Data Warehouse, where it is stored and made available for analysis.
Data Warehouse: A centralized repository that stores integrated data from multiple sources, optimized for query and analysis.
OLAP Engine (Online Analytical Processing): Tools that allow for complex analytical queries and multi-dimensional data analysis.
Analytic Reporting Engine: Produces reports and dashboards for users to visualize and understand the data.
Advanced Analytics: Includes data mining, predictive modeling, and other sophisticated analytical techniques to uncover hidden patterns and insights.
Users: Business analysts, managers, and executives who use the IMS to make informed decisions.
Outcome: The IMS enables data-driven decision-making, improving efficiency, reducing risks, and enhancing overall business performance.
Metadata is essentially information about structured data. It can include details like the date and time a database or file was created, who created it, and what types of information it contains.
A data dictionary is a more specific type of metadata that describes the structure, content, and format of a dataset. It’s like a guidebook and a codebook that provides a comprehensive list of all the variables or columns in a dataset, along with their definitions, data types, and other attributes.
mtcars
The mtcars
data file provides information on various features of different brands of cars, including their engine size, horsepower, and fuel efficiency. The dataset is structured as a table, where each row represents a different car, and each column represents a different variable or feature of the cars.
mtcars
?mtcars
SalesData.xlsx
This data table is collected and prepared by the IT department of a company, OfficeChain, about its eCommerce sales channel.
SalesData.xlsx
There are plenty of college courses to choose (course titles may vary by schools):
Database Management Systems: This course focuses on the design, implementation, and management of databases, teaching students how to organize and manage data effectively.
Information Security and Privacy: This course covers the principles and practices of securing information and ensuring data privacy, preparing students to handle data security challenges.
Data Governance and Management: This course explores the governance and management of data assets, including data quality, data integration, and data lifecycle management.
Information Systems Analysis and Design: This course teaches students how to analyze business requirements and design information systems to meet organizational needs.
Structured Data: Highly organized and formatted data that is easily searchable (e.g., tables with rows and columns).
Databases: Used to store and manage structured data efficiently.
Relational Databases: Organizes data into tables that can be linked by shared keys.
Non-Relational Databases: NoSQL databases designed for unstructured data and scalability.
- Advantages: Handle large-scale data across distributed systems more effectively than traditional RDBMS.
Meta Data: Information describing other data, providing context and making it easier to understand.
Data Dictionary: Detailed description of dataset variables, ensuring consistent data usage.
Business Case Studies (mtcars
and SalesData.xlsx
)
mtcars
Dataset: Structured data on car specifications, useful for learning data manipulation and statistical analysis.SalesData.xlsx
: eCommerce sales data, suitable for business analytics case studies.Data Mining Lab