First Steps in Modeling

Futuristic living room with neon lights where a person explores a transparent interface filled with weather, transit, and ID checks

Getting Your Hands Dirty: Your First Model

Tiny predictions shape daily life. They steer your route, remind you to grab an umbrella, and help a cashier decide if you look old enough for a movie. Data-driven guesses keep things flowing.

Computers make the same calls but faster and with mountains of information your brain can’t hold. They rely on clear rules and don’t get tired, so their results stay steady.

Tech enthusiast reviewing a glowing spreadsheet and floating code in a dim server room

Meet Your Data: Getting Ready to Model

Your computer can’t guess from nothing. It needs carefully collected examples—rows in a table that capture past situations.

Each column is a feature that describes the row, such as rooms, square footage, or age of a house. One column is the target, the value you hope to predict, like price.

You load data with pandas in Python.

import pandas as pd

data = pd.read_csv("house_prices.csv")  # pretend you have this file
print(data.head())

The head command shows the first five rows so you can spot issues early.

Illustration of a park path splitting into training and testing trails with signposts

Splitting Up: Training and Testing Sets

A model should never judge itself on data it already saw. Fresh questions reveal real skill.

So you divide your table. One slice trains the model, the other measures its performance on unseen rows.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

That 20 % test share leaves enough samples to trust the score.

Steampunk lab where a scientist feeds data gears into a glowing prediction machine

Your First Model: Fitting and Predicting

A model is a learned equation built from your training slice. Fitting means teaching this equation to match the patterns.

For prices, pick LinearRegression. For labels like spam or not spam, choose LogisticRegression.

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)

Then check the root mean squared error (RMSE).

from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_test, predictions) ** 0.5)

Switch to classification metrics when dealing with categories.

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print((y_pred == y_test).mean())