Discover the Julia Machine Learning Ecosystem

A Comprehensive Overview

Daniel Molina Cabrera

December 4, 2024

Source of the material

Disclaimer: It contains personal opinions

About me

  • Assistant Professor in University of Granada.
  • Researchers in Artificial Intelligence and Intelligent Optimization.
  • Teaching ML about 5 years using Python and R.
  • Using Julia in research, and in a project with important enterprise.

Julia is a good alternative for ML?

Some preliminar considerations

  • You cannot go against the technological platform of the project.

  • If you are the only person who knows Julia, no-other can change it.

  • Improvement of time could be important, but it depends on the project.

  • It work more nicely if it is integrated in Python packages.

Quixote against windmills

ML Ecosystem from a global point of view

  • Matrices: Package to manage matrices in CPU or GPU (Numpy, Pytorch, or JAX).

  • DataFrames: To manage dataframes, read/write from CSV, Excel and other formats. It can be very complex (Pandas or Polars).

  • Visualization: Packages to visualize DataFrames easily (seaborn) to the lower level package to change details (Matplotlib).

  • Machine Learning: Package with the ML models that can be trained and applied.

  • Deep Learning: Create DL models, training, and inference.

Visualization of common packages

Packages comparison
Functionality R Python Julia
Matrices Matrix Numpy,JAX Standard,CUDA.jl
DataFrames dplyr Pandas, Polars DataFrames
Visualization ggplot Seaborn/Matplotlib ggplot,AlgebraOfGraphics/Makie
Machine Learning caret, mlr3 scikit-learn MLJ
Deep Learning Keras Pytorch, Keras Metalhead/Flux

Matrices

  • Vectors and matrices are integrated into the standard library.

  • The code is compiled and then run, so the looping is not slow.

  • You can use vectorize operations (sum, filter, map, …), for conciseness, not performance.

function distEuc(sol1, sol2)
    return sqrt(sum((sol1 - sol2).^2 ))
end

sol1 = 1:1_000
sol2 = ones(length(sol1))
distEuc(sol1, sol2)
18243.72494859534
  • Any function can be vectorized adding a point after function (or before operator).
poly(x)=3*x+5
sol = 1:5
poly.(sol)
5-element Vector{Int64}:
  8
 11
 14
 17
 20

The performance is nearly similar:

using BenchmarkTools

function distEuc2(sol1, sol2)
   value = 0.0
   
    for i in eachindex(sol1, sol2)
        value += (sol1[i] - sol2[i])^2
    end
    
    return sqrt(value)
end

# Check output is the same
distEuc2(sol1, sol2)  distEuc(sol1, sol2)
true
@btime distEuc(sol1, sol2)
@btime distEuc2(sol1, sol2)
  1.401 μs (7 allocations: 15.77 KiB)
  864.550 ns (1 allocation: 16 bytes)
18243.72494859534

DataFrame: DataFrames.jl

VS

DataFrames

  • It one of the better packages.

  • It has a great API in comparison with Pandas (easy).

using CSV, DataFrames

df = CSV.read("starwars/planets.csv", missingstring="NA", DataFrame)
first(df, 3)
3×9 DataFrame
Row name rotation_period orbital_period diameter climate gravity terrain surface_water population
String15? Int64? Int64? Int64? String31? String? String? Float64? Int64?
1 Alderaan 24 364 12500 temperate 1 standard grasslands, mountains 40.0 2000000000
2 Yavin IV 24 4818 10200 temperate, tropical 1 standard jungle, rainforests 8.0 1000
3 Hoth 23 549 7200 frozen 1.1 standard tundra, ice caves, mountain ranges 100.0 missing

Package DataFramesMeta of the same author make it easy to combine operators:

using Statistics, DataFramesMeta

df_mean = @chain df begin
    # Filter with surface of water
    @subset(:surface_water .< 40)
    # Ignoring diameter missing
    @subset(.! ismissing.(:diameter))
    # Group by climate
    @groupby(:climate)
    # Calculate the diameter_mean for climate
    @combine(:diameter_mean = mean(:diameter))
end
first(df_mean, 5)
5×2 DataFrame
Row climate diameter_mean
String31? Float64
1 temperate, tropical 10200.0
2 murky 8900.0
3 temperate 32574.0
4 temperate, arid 11370.0
5 temperate, arid, windy 12900.0

Extra coming from R: TidierData.jl

  • Copy the R interface, more intuitive for people with experience in R.
using Tidier

@chain df begin
    @filter(surface_water < 40)
    @filter(!ismissing(diameter))
    @group_by(climate)
    @summarize(diameter_mean = mean(diameter))
    @slice(1:5)
end
5×2 DataFrame
Row climate diameter_mean
String31? Float64
1 temperate, tropical 10200.0
2 murky 8900.0
3 temperate 32574.0
4 temperate, arid 11370.0
5 temperate, arid, windy 12900.0

Advantages of DataFrames.jl

  1. Efficient: Built on top of Julia’s powerful and efficient array-based data structures.

  2. Flexible: It supports various data sources, including CSV, TSV, Excel, and SQL databases.

  3. User-friendly: In combination with DataFramesMeta.jl.

  4. Compatible: Integrated with other Julia packages, data analysis and visualization.

  5. Strong typing: Data manipulation operations are type-stable, giving an efficient code execution.

  6. Missing data handling: It provides robust handling of missing or null data values.

Personal example

In a research project, I need to combine several columns in a specific way, to create new ones.

  • In pandas was not easy at all.

  • Neither option was efficient.

  • In DataFrames was very easy and efficient:

df2 = transform(df_p, names(df_p, r".*[0-9]$") => ByRow(translate) => ["MERGED_label1", "MERGED_label2", "MERGED_label3",
                                                                       "MERGED_prob1", "MERGED_prob2", "MERGED_prob3"])

This allow to detect all columns ending with number, calling by row to a specific function, and incorporate the results as new columns.

Conclusions about DataFrames.jl

  • Great documentation.

  • Very complex.

  • Author very implicated in the package, and in the community.

  • If all packages were as madure as this one…

Conclusion: Mature enough to use in your preprocessing workflow.

Visualization: Makie and AlgebraOfGraphics

Visualization

  • Experience in Python is using Altair and Seaborn to visualize dataframes.

  • Matplotlib is only used when it is required.

  • Makie is a complete alternative to Matplot, does not work directly with DataFrames.

  • AlgebraOfGraphics is a tool over Makie, inspired in ggplot2.

  • There are other packages: JuliaPlots, …

Example visualization pinguins

First, with Python:

import seaborn as sns
penguins = sns.load_dataset("penguins")

# setting the dimensions of the plot
g=sns.relplot(x="bill_length_mm", y="bill_depth_mm", data=penguins,
              hue="species", aspect=2)
plt.show()

Example in Julia

using PalmerPenguins, DataFrames
using AlgebraOfGraphics
using CairoMakie

penguins = DataFrame(PalmerPenguins.load()) |> dropmissing

plt = data(penguins) * mapping(:bill_length_mm, :bill_depth_mm, color = :species)

axis = (width = 500, height = 250)
draw(plt; axis = axis)

There is more,

  • It can split figure using col and row as seaborn.
  • Different styles.
  • Select the visual type with specific parameters.
plt2 = plt * mapping(col=:island) * visual(Scatter, markersize=15)

draw(plt2; axis = (width = 500, height = 500))

Extra coming from R: TidierPlot.jl

  • Similar to TidierData, using ggplot.
using Tidier

ggplot(penguins, @aes(x=bill_length_mm, y=bill_depth_mm, color=island)) + geom_point()

geom_point
data: inherits from plot
x: inherits from plot 
y: inherits from plot 

Conclusions about Visualization in Julia

  • It is good.

  • Makie has a great documentation.

  • AlgebraOfGraphics is nice, worse documentation.

  • TidierPlots work, but there are missing features (like facet_wrap).

Conclusion: Make is madure and documented, AlgebraOfGraphics should improve.

Machine Learning: MLJ

Library MLJ.jl

  • It tries to be the scikit-learn for Julia.

  • It is a global wrapping API, does not implement algorithms.

  • Models Implemented in different packages: DecisionTrees.jl, MLFlux.jl, ParallelKMeans.jl, …..

  • include also models from scikit-learn, by ScikitLearn.jl.

  • Tuning.

  • Compatible packages for imbalanced data.

MLJ Ecosystem

MLJ Example

 using MLJ

 iris = load_iris();
selectrows(iris, 1:3) |> pretty
┌──────────────┬─────────────┬──────────────┬─────────────┬──────────────────────────────────┐
│ sepal_length │ sepal_width │ petal_length │ petal_width │ target                           │
│ Float64      │ Float64     │ Float64      │ Float64     │ CategoricalValue{String, UInt32} │
│ Continuous   │ Continuous  │ Continuous   │ Continuous  │ Multiclass{3}                    │
├──────────────┼─────────────┼──────────────┼─────────────┼──────────────────────────────────┤
│ 5.1          │ 3.5         │ 1.4          │ 0.2         │ setosa                           │
│ 4.9          │ 3.0         │ 1.4          │ 0.2         │ setosa                           │
│ 4.7          │ 3.2         │ 1.3          │ 0.2         │ setosa                           │
└──────────────┴─────────────┴──────────────┴─────────────┴──────────────────────────────────┘
schema(iris)
┌──────────────┬───────────────┬──────────────────────────────────┐
│ names        │ scitypes      │ types                            │
├──────────────┼───────────────┼──────────────────────────────────┤
│ sepal_length │ Continuous    │ Float64                          │
│ sepal_width  │ Continuous    │ Float64                          │
│ petal_length │ Continuous    │ Float64                          │
│ petal_width  │ Continuous    │ Float64                          │
│ target       │ Multiclass{3} │ CategoricalValue{String, UInt32} │
└──────────────┴───────────────┴──────────────────────────────────┘

Schema are the types, you can convert.

Checking the possible models

y, X = unpack(iris, ==(:target); rng=123);
# Check the possible models for these data
models(matching(X,y))
54-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :constructor, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :reporting_operations, :reports_feature_importances, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :target_in_fit, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}}:
 (name = AdaBoostClassifier, package_name = MLJScikitLearnInterface, ... )
 (name = AdaBoostStumpClassifier, package_name = DecisionTree, ... )
 (name = BaggingClassifier, package_name = MLJScikitLearnInterface, ... )
 (name = BayesianLDA, package_name = MLJScikitLearnInterface, ... )
 (name = BayesianLDA, package_name = MultivariateStats, ... )
 (name = BayesianQDA, package_name = MLJScikitLearnInterface, ... )
 (name = BayesianSubspaceLDA, package_name = MultivariateStats, ... )
 (name = CatBoostClassifier, package_name = CatBoost, ... )
 (name = ConstantClassifier, package_name = MLJModels, ... )
 (name = DecisionTreeClassifier, package_name = BetaML, ... )
 (name = DecisionTreeClassifier, package_name = DecisionTree, ... )
 (name = DeterministicConstantClassifier, package_name = MLJModels, ... )
 (name = DummyClassifier, package_name = MLJScikitLearnInterface, ... )
 ⋮
 (name = RandomForestClassifier, package_name = MLJScikitLearnInterface, ... )
 (name = RidgeCVClassifier, package_name = MLJScikitLearnInterface, ... )
 (name = RidgeClassifier, package_name = MLJScikitLearnInterface, ... )
 (name = SGDClassifier, package_name = MLJScikitLearnInterface, ... )
 (name = SVC, package_name = LIBSVM, ... )
 (name = SVMClassifier, package_name = MLJScikitLearnInterface, ... )
 (name = SVMLinearClassifier, package_name = MLJScikitLearnInterface, ... )
 (name = SVMNuClassifier, package_name = MLJScikitLearnInterface, ... )
 (name = StableForestClassifier, package_name = SIRUS, ... )
 (name = StableRulesClassifier, package_name = SIRUS, ... )
 (name = SubspaceLDA, package_name = MultivariateStats, ... )
 (name = XGBoostClassifier, package_name = XGBoost, ... )

Applying a specific model

# Load the class from the package
Tree = @load DecisionTreeClassifier pkg=DecisionTree

model = Tree()
# Apply Cross validation

evaluate(model, X, y,
                resampling=CV(nfolds=5, shuffle=true),
                measures=[log_loss, accuracy],
                verbosity=0)
import MLJDecisionTreeInterface ✔
PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌───┬──────────────────────┬──────────────┬─────────────┐
│   │ measure              │ operation    │ measurement │
├───┼──────────────────────┼──────────────┼─────────────┤
│ A │ LogLoss(             │ predict      │ 2.4         │
│   │   tol = 2.22045e-16) │              │             │
│ B │ Accuracy()           │ predict_mode │ 0.933       │
└───┴──────────────────────┴──────────────┴─────────────┘
┌───┬─────────────────────────────────┬─────────┐
│   │ per_fold                        │ 1.96*SE │
├───┼─────────────────────────────────┼─────────┤
│ A │ [2.4, 2.22e-16, 1.2, 7.21, 1.2] │ 2.76    │
│ B │ [0.933, 1.0, 0.967, 0.8, 0.967] │ 0.0766  │
└───┴─────────────────────────────────┴─────────┘

Comparing several models

RF = @load RandomForestClassifier pkg=DecisionTree

for (name, model) in [("DT", Tree()), ("RandomForest", RF())]
    sal = evaluate(model, X, y,
                resampling=CV(nfolds=5, shuffle=true),
                measures=[log_loss, accuracy],
                verbosity=0)
    log, accu = sal.measurement
    println("$(name): $log, $accu")
end
import MLJDecisionTreeInterface ✔
DT: 2.1626192033470293, 0.9400000000000001
RandomForest: 0.11242771720738147, 0.9533333333333334

Training and then applying a model

using MLJBase

RF = @load RandomForestClassifier pkg=DecisionTree

model = RF()
# holdout = Holdout(fraction_train=0.7, shuffle=true, rng=35)
train_index, test_index = partition(eachindex(y), 0.7, shuffle=true, rng=35)
mach = machine(model, X, y)
# Training
fit!(mach, rows=train_index)
# Predict
first(predict(mach, rows=test_index), 5)
import MLJDecisionTreeInterface ✔
5-element UnivariateFiniteVector{Multiclass{3}, String, UInt32, Float64}:
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>0.99, virginica=>0.01)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>0.31, virginica=>0.69)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>0.94, virginica=>0.06)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>0.31, virginica=>0.69)
 UnivariateFinite{Multiclass{3}}(setosa=>0.99, versicolor=>0.01, virginica=>0.0)

Conclusions about MLJ

  • It is extremely well documented.

  • In continuous improvement.

  • There is a step-learning from scikit-learn, specially about the scitypes.

  • Include TunedModels, but it is not so confortable as scikit-learn.

  • It can use scikit-learn models, but the error messages in that case are not good.

Conclusion: Good Tool, it compensates for learning, but worse than scikit-learn.

Deep Learning: Flux.jl and Lux.jl

Deep Learning: Flux.jl and Tools

  • Flux.jl is the package for Deep Learning.

  • Lux is another one without mutation.

  • Both works in GPU, with worse performance than PyTorch.

  • All implemented in Julia, source code easy to read.

  • Installation without problem.

  • Very easy the API, errors few intuitive.

  • Metalhead.jl includes Convolutional models, but few have pre-trained values.

  • Implementation of FastAI, make easier to work with Flux models.

Conclusion: Still not mature enough for DL for complex usage.

Symbolic Regression

Symbolic Regression

  • SymbolicRegression.jl is a great package.

  • Wrapper using PySR in Python.

  • Successfully used in enterprise optimization problem.

Conclusion: In my opinion the best SR package in Julia/Python. The Python package make easier to incorporate it.

Conclusions

  • Julia is a great language for scientific computation.

  • Julia is ready for preprocessing, Data Manipulation and Visualization.

  • MLJ is rather ready for ML problems, it requires adaptation.

  • Deep Learning in Julia is interesting, but it still not mature enough for complex problems.

  • Wrapping Julia packages in Python packages seems a good strategy.

Conclusion: You should consider Julia for preprocessing, visualization, and for several excelent packages.

Thank you for your attention